Notifications

Clear all

[Closed] Fast search through text file

Page 4 / 10 Prev Next

Jan 16, 2023 1:31 am

the pure MXS dotnet is not bad too:

(
	t0 = timestamp()
	h0 = heapfree

matches = (dotnetclass "System.Text.RegularExpressions.Regex").Matches strToCheck "kappa"
k = for i=0 to matches.count-1 collect (matches.item i).index

	format " count:% == time:% heap:%\n" k.count (timestamp() - t0) (h0 - heapfree) 
)

denisT

Jan 16, 2023 1:31 am

the well-known issue kills performance – iterating of dotnet collections.

miauu

Jan 16, 2023 1:31 am

If anyone knows how to make regEx to find —– or any other number of – at the beginning of the line – please help.
I will try to find the answer tomorrow evening. Now it is time to go to the bed.

Thank you one more time.

1 Reply

Serejah

(@serejah)

Joined: 10 months ago

Posts: 0

Jan 16, 2023 1:31 am

Reply to

miauu

the regex syntax is pretty simple, still I’d suggest you to read some good book about how to write efficient expressions. It will make your life easier many many times in the future

^ – beginning of the line

(dotNetClass "system.text.regularexpressions.regex").isMatch "----"  "^[-]+"  -- true
(dotNetClass "system.text.regularexpressions.regex").isMatch " ----" "^[-]+"  -- false
(dotNetClass "system.text.regularexpressions.regex").isMatch " ----" "[-]+"   -- true
(dotNetClass "system.text.regularexpressions.regex").isMatch " ----" "-+"     -- true
(dotNetClass "system.text.regularexpressions.regex").isMatch " +-+-" "-[-+]+" -- true

denisT

Jan 16, 2023 1:31 am

as I said above, it’s very easy to get… just find ends of all lines first

Serejah

Jan 16, 2023 6:36 am

hmm… but it will require a linear search of the index that is larger than the offset for every match just to know the line number.
did you mean smth like this?



(
	ss = strToCheck as StringStream
	seek ss 0
	
	ends_of_lines_offsets = #()
	
	while not eof ss do
	(
		if skiptostring ss "\n" != undefined do append ends_of_lines_offsets (filePos ss)
	)
	
	seek ss 0
	skipToString ss "kappa omicron"
	offset = filePos ss
	

	line_number = for i = 1 to ends_of_lines_offsets.count where ends_of_lines_offsets[i] > offset do exit with i
			
	seek ss 0
	for i = 1 to line_number - 1 do skipToNextLine ss
	readLine ss
)

denisT

Jan 16, 2023 6:36 am

something like this:

(
	t0 = timestamp()
	h0 = heapfree

	r = (re.finditer "\n" strToCheck)
	k = (re.finditer "kappa" strToCheck)

	y = 0
	s = 0
	i = (r.next()).start()
		
	pos = #()
	while (x = try((k.next()).start()) catch()) != undefined do
	(
		while x > i do
		(
			s = i
			y += 1
			i = (r.next()).start()
		)
		append pos [x,y,x-s]
	)

	format " PY count:% == time:% heap:%\n" pos.count (timestamp() - t0) (h0 - heapfree)
	pos
)

denisT

Jan 16, 2023 6:36 am

I don’t know how to suppress “StopIteration” yielding yet… it needs some “wrapping” of python try/except.

1 Reply

denisT

(@denist)

Joined: 10 months ago

Posts: 0

Jan 16, 2023 6:36 am

Reply to

denisT

ok… found…
next(iterator, default)
Retrieve the next item from the iterator by calling its next() method. If default is given, it is returned if the
iterator is exhausted, otherwise StopIteration is raised.

so finally:

(
	re = python.import "re"
	next = (python.import "__builtin__").next

	t0 = timestamp()
	h0 = heapfree

	r = (re.finditer "\n" strToCheck)
	k = (re.finditer "kappa" strToCheck)

	i = (next r).start()
		
	y = 0
	s = 0

	pos = #()
	while (x = next k undefined) != undefined do
	(
		x = x.start() 
		while x > i do
		(
			y += 1
			s = i
			i = (next r).start()
		)
		append pos [x,y,x-s]
	)

	format " PY count:% == time:% heap:%\n" pos.count (timestamp() - t0) (h0 - heapfree)
	pos
)

Serejah

Jan 16, 2023 6:36 am

ok, I see.
python is order of magnitude slower for this expression @"(^|\n|\r)\s*kappa" which matches lines starting with kappa
guess we have to write plain python code and execute it with python.execute to make it efficient. But how do we get the data back to mxs?

pyhon
PY count:1896 == time:1.784 sec. heap:15298128L

c#
Find words 0.172 sec.
1896

miauu

Jan 16, 2023 6:36 am

With python, going through all 50000 lines of a text file and finding all lines which starts wit “–”, skipping the empty spaces takes 0.086 sec and finds 4422 lines. I am pretty sure that the code can be optimized:

import os
import sys
import time

txtPath = "C:\SampleText.txt"

spaceStr = " "

dashStr = "--"
dashArr = []
dashLineNumArr = []

st = time.time()



with open(txtPath, "r") as file:
    for num, line in enumerate(file, 1):
        if len(line) != 0:
            stringArr = line.split()
            if stringArr:
                first_word = stringArr[0]
                f1 = first_word.rstrip('\n')
                f2 = f1.rstrip('\t')
                str1 = (f2.lstrip('\t')).lower()                
                if dashStr == str1:
                    dashArr.append(line)
                    dashLineNumArr.append(num)
                        
et = time.time()

# get the execution time
elapsed_time = et - st
print('Execution time:', elapsed_time, 'seconds')
print ("defCommnetsArr", len(dashArr))

miauu

Jan 16, 2023 6:36 am

This works:

(
	global PYTHON_RETURN
	fn CollectOldFiles dir threshold_days: =
	(
		local pyCmd = StringStream ""
		
		format "
import os, time

def collect_old_files(file_dir, threshold_days=None):
	if threshold_days == None:
		threshold_days = 0
		
	threshold_time = ( time.time() ) - ( 60 * 60* 24 * threshold_days )
		
	file_list = [os.path.join(file_dir, f) for f in os.listdir(file_dir) if '.ps1' in f and ( os.path.getmtime(os.path.join(file_dir, f)) <= threshold_time )]
	
	return file_list
		
		
		
files = collect_old_files(r'%', %)

arr = '#({0})'.format(','.join([str('@\"'+str(n)+'\"') for n in files]))
MaxPlus.Core.EvalMAXScript('PYTHON_RETURN = {0}'.format(arr))

		" ( TrimRight dir "\\" ) threshold_days to:pyCmd
		
		python.execute ( pyCmd as string )
		
		::PYTHON_RETURN
	)
	
	old_files = CollectOldFiles ( @"C:M1\" ) threshold_days:1
	
	format "old_files: %\n" old_files[1]
)

Page 4 / 10 Prev Next