Notifications

Clear all

[Closed] Fast search through text file

Page 5 / 10 Prev Next

denisT

Jan 16, 2023 6:36 am

I don’t see any reason to use Python… I’m pretty sure C# (dotnet) can do it just as fast.

3 Replies

miauu

(@miauu)

Joined: 1 year ago

Posts: 0

Jan 16, 2023 6:36 am

Reply to

denisT

Denis, I can’t send you the text file, sorry. But the string, generated by the maxscript code should be enough for performance testing. Serejah is using seed 12345, I used seed 123, but since last evening I also use 12345, so the generated string has to be the same.

Here is a text file, generated by the script(I have added —– in the wordsArr): https://drive.google.com/file/d/1CIzUFv_p_0ae9PX49AcuA-wcJu5W0AwS/view?usp=share_link

The task is to find lines that starts with “kappa”, or “KaPpA”(or any other combination of upper and lowercase letters) no matter of the empty spaces before the “kappa”.

When I saw the speed of the pure maxscript I decided to find another solution. Since I can use Python, but not C# I decided to try with python and it proves to be much much faster than maxscript. But, then the problem with executing the python inside maxscript and getting data back arised.
As I said, learning how to execute python inside maxscript is something that I want to learn.

denisT

(@denist)

Joined: 1 year ago

Posts: 0

Jan 16, 2023 6:36 am

Reply to

miauu

(
	t0 = timestamp()
	h0 = heapfree

ss = filterstring strToCheck "\n"
pt = "kappa*"
ii = for k=1 to ss.count where matchpattern (trimleft ss[k] " \t") pattern:pt collect k

	format "count:% time:% heap:%\n" ii.count (timestamp() - t0) (h0 - heapfree)
)

it’s very fast… I don’t see the reason to get it faster

miauu

(@miauu)

Joined: 1 year ago

Posts: 0

Jan 16, 2023 6:36 am

Reply to

denisT

It is the same as this one: Fast search through text file

What I have only for finding the Kappa:
time:76 heap:9919600L
kappaArr: 338

denisT

Jan 16, 2023 6:36 am

could you post (or send me) the file you are using for the test? (SampleText.txt)
to measure performance we have to test the same source

denisT

Jan 16, 2023 6:36 am

oops… I seem to have missed the point of the task. Do we only need to find lines that start with “kappa”?
Once again, what do we need?

Serejah

Jan 16, 2023 6:36 am

as of the source string, I just set a random seed to 12345 to make sure it is always the same thing and used it to test

denisT

Jan 16, 2023 6:36 am

so is it not as simple as:

(
	t0 = timestamp()
	h0 = heapfree

ss = filterstring strToCheck "\n"
rx = dotnetobject "System.Text.RegularExpressions.Regex" "^(kappa|omicron)"
ii = for k=1 to ss.count where rx.IsMatch ss[k] collect k

	format "count:% time:% heap:%\n" ii.count (timestamp() - t0) (h0 - heapfree)
ii
)

Serejah

Jan 16, 2023 6:36 am

the only trouble I see is that not every match might be valid since “kappa*” pattern will match “kappa ” exactly as “kappaz”, that’s why I dismissed startswith and switched to regex.
Of course another extra check for the whitespace character could be added for every candidate, but I bet regex will be faster

denisT

Jan 16, 2023 6:36 am

i have for “kappa*”:
count:1866 time:54 heap:10019252L

do you need for many patterns?

miauu

Jan 16, 2023 6:36 am

I have:
time:88 heap:10017612L
kappaArr: 1860 – this difference is because I use the dynamically generated string.

Yes, it have to find all words if needed – for each word collect line number, text on this line.

denisT

Jan 16, 2023 6:36 am

but it’s faster than your Python example!

miauu

Jan 16, 2023 6:36 am

I have to check it against this:


import os
import sys
import time

txtFile = "H:\E_Desktop\M1\50000LinesOfText.txt"

_alpha = "alpha"
_beta = "beta" 
_gama = "gama"
_delta = "delta" 
_Epsilon = "epsilon" 
_Zeta = "zeta" 
_Eta = "eta" 
_Theta = "theta"
_Iota = "iota"
_kaPPa = "kappa"
_LamBda = "lambda"
_mU = "mu"
_Nu = "nu" 
_xi = "xi"
_omicron = "omicron"
_pi = "pi"
_rHo = "rho" 
_siGma = "sigma"
_Tau = "tau"
_UpSiLoN = "upsilon"
_pHi = "phi"
_chi = "chi"
_psi = "psi"
_omega = "omega"

tokensArr = [_alpha, _beta, _gama, _delta, _Epsilon, _Zeta, _Eta, _Theta, _Iota, _kaPPa, _mU, _Nu, _xi, _omicron, _pi, _siGma, _Tau, _UpSiLoN, _pHi, _chi, _psi, _omega,   _LamBda,   _rHo,   ]


# define the arrays to store the data
defalphaArr = []
defbetaArr = []
defgamaArr = []
defdeltaArr = []
defEpsilonArr = []
defZetaArr = []
defEtaArr = []
defThetaArr = []
defIotaArr = []
defkaPPaArr = []
defmUArr = []
deNuArr = []
defxiArr = []
defomicronArr = []
defpiArr = []
defrHoArr = []
defsiGmaArr = []
defTauArr = []
defUpSiLoNArr = []
defpHiArr = []
defchiArr = []
defpsiArr = []
defomegaArr = []
defLamBdaArr = []

defalphaLineNumArr = []
defbetaLineNumArr = []
defgamaLineNumArr = []
defdeltaLineNumArr = []
defEpsilonLineNumArr = []
defZetaLineNumArr = []
defEtaLineNumArr = []
defThetaLineNumArr = []
defIotaLineNumArr = []
defkaPPaLineNumArr = []
defmULineNumArr = []
deNuLineNumArr = []
defxiLineNumArr = []
defomicronLineNumArr = []
defpiLineNumArr = []
defrHoLineNumArr = []
defsiGmaLineNumArr = []
defTauLineNumArr = []
defUpSiLoNLineNumArr = []
defpHiLineNumArr = []
defchiLineNumArr = []
defpsiLineNumArr = []
defomegaLineNumArr = []
defLamBdaLineNumArr = []


spaceStr = " "

st = time.time()

with open(txtFile, "r") as file:
    for num, line in enumerate(file, 1):
        if len(line) != 0:
            stringArr = line.split()
            if stringArr:
                first_word = stringArr[0]
                f1 = first_word.rstrip('\t\n')
                str1 = (f1.lstrip('\t')).lower()
                for token in tokensArr:
                    if token == str1:
                        if str1 == _alpha:
                            defalphaArr.append(first_word + spaceStr + stringArr[1])
                            defalphaLineNumArr.append(num)
                            break
                        elif str1 == _beta:
                            defbetaArr.append(first_word + spaceStr + stringArr[1])
                            defbetaLineNumArr.append(num)
                            break
                        elif str1 == _gama:
                            defgamaArr.append(first_word + spaceStr + stringArr[1])
                            defgamaLineNumArr.append(num)
                            break
                        elif str1 == _delta:
                            defdeltaArr.append(first_word + spaceStr + stringArr[1])
                            defdeltaLineNumArr.append(num)
                            break
                        elif str1 == _Epsilon:
                            defEpsilonArr.append(first_word + spaceStr + stringArr[1])
                            defEpsilonLineNumArr.append(num)
                            break
                        elif str1 == _Zeta:
                            defZetaArr.append(first_word + spaceStr + stringArr[1])
                            defZetaLineNumArr.append(num)
                            break
                        elif str1 == _Eta:
                            defEtaArr.append(first_word + spaceStr + stringArr[1])
                            defEtaLineNumArr.append(num)
                            break
                        elif str1 == _Theta:
                            defThetaArr.append(first_word + spaceStr + stringArr[1])
                            defThetaLineNumArr.append(num)
                            break
                        elif str1 == _Iota:
                            defIotaArr.append(line)
                            defIotaLineNumArr.append(num)
                            break
                        elif str1 == _kaPPa:
                            defkaPPaArr.append(line)
                            defkaPPaLineNumArr.append(num)
                            break
                        elif str1 == _LamBda:
                            defLamBdaArr.append(line)
                            defLamBdaLineNumArr.append(num)
                            break
                        elif str1 == _mU:
                            defmUArr.append(first_word + spaceStr + stringArr[1])
                            defmULineNumArr.append(num)
                            break
                        elif str1 == _Nu:
                            deNuArr.append(line)
                            deNuLineNumArr.append(num)
                            break
                        elif str1 == _xi:
                            defxiArr.append(first_word + spaceStr + stringArr[1])
                            defxiLineNumArr.append(num)
                            break
                        elif str1 == _omicron:
                            defomicronArr.append(first_word + spaceStr + stringArr[1])
                            defomicronLineNumArr.append(num)
                            break
                        elif str1 == _pi:
                            defpiArr.append(first_word + spaceStr + stringArr[1])
                            defpiLineNumArr.append(num)
                            break
                        elif str1 == _rHo:
                            defrHoArr.append(first_word + spaceStr + stringArr[1])
                            defrHoLineNumArr.append(num)
                            break
                        elif str1 == _siGma:
                            defsiGmaArr.append(line)
                            defsiGmaLineNumArr.append(num)
                            break
                        elif str1 == _Tau:
                            defTauArr.append(line)
                            defTauLineNumArr.append(num)
                            break
                        elif str1 == _UpSiLoN:
                            defUpSiLoNArr.append(line)
                            defUpSiLoNLineNumArr.append(num)
                            break
                        elif str1 == _pHi:
                            defpHiArr.append(line)
                            defpHiLineNumArr.append(num)
                            break
                        elif str1 == _chi:
                            defchiArr.append(first_word + spaceStr + stringArr[1])
                            defchiLineNumArr.append(num)
                            break
                        elif str1 == _psi:
                            defpsiArr.append(line)
                            defpsiLineNumArr.append(num)
                            break
                        elif str1 == _omega:
                            defomegaArr.append(first_word + spaceStr + stringArr[1])
                            defomegaLineNumArr.append(num)
                            break
                        

        
et = time.time()

# get the execution time
elapsed_time = et - st
print('Execution time:', elapsed_time, 'seconds')


print ("defalphaArr", len(defalphaArr))
print ("defbetaArr", len(defbetaArr))
print ("defgamaArr", len(defgamaArr))
print ("defdeltaArr", len(defdeltaArr))
print ("defEpsilonArr", len(defEpsilonArr))
print ("defZetaArr", len(defZetaArr))
print ("defEtaArr", len(defEtaArr))
print ("defThetaArr", len(defThetaArr))
print ("defIotaArr", len(defIotaArr))
print ("defkaPPaArr", len(defkaPPaArr))
print ("defmUArr", len(defmUArr))
print ("defpHiArr", len(defpHiArr))
print ("deNuArr", len(deNuArr))
print ("defxiArr", len(defxiArr))
print ("defomicronArr", len(defomicronArr))
print ("defpiArr", len(defpiArr))
print ("defsiGmaArr", len(defsiGmaArr))
print ("defTauArr", len(defTauArr))
print ("defUpSiLoNArr", len(defUpSiLoNArr))
print ("defchiArr", len(defchiArr))
print ("defpsiArr", len(defpsiArr))
print ("defomegaArr", len(defomegaArr))
print ("defLamBdaArr", len(defLamBdaArr))
print ("defrHoArr", len(defrHoArr))


print ("defalphaArr", defalphaArr[0])

Result:

Execution time: 0.186000108719 seconds
defalphaArr 1852
defbetaArr 1856
defgamaArr 1896
defdeltaArr 1877
defEpsilonArr 1800
defZetaArr 1844
defEtaArr 1779
defThetaArr 1751
defIotaArr 1846
defkaPPaArr 1837
defmUArr 1855
defpHiArr 1790
deNuArr 1817
defxiArr 1840
defomicronArr 1859
defpiArr 1863
defsiGmaArr 1931
defTauArr 1865
defUpSiLoNArr 1948
defchiArr 1794
defpsiArr 1807
defomegaArr 1806
defLamBdaArr 1776
defrHoArr 1754
defalphaArr alpha delta

Page 5 / 10 Prev Next