Regex searches on text files

WARNING: You currently have Javascript disabled!
This website will not function correctly without Javascript enabled.

Paul R1

14 discussion posts

Keith,

I've observed another behaviour which I'd like to bring to your attention regarding regex searches. I was finding that my searches were apparently just hanging. After some investigation what I've found is that it's not a hang but rather an exponentially increasing search time when carrying out a regex search on a text file. I found it on a very large csv file but the same is true for text files as they use the same handler (query.dll).

Actually it seems to be because I'm using a positive lookahead so it may just be a fact of life but I tried doing a search on a text file that I kept doubling in size with the following results:

test 1 - 20 lines/726 characters - positive lookahead search takes 4.815 seconds
test 2 - 40 lines/14572 characters - positive lookahead search takes 18.884 seconds
test 3 - 80 lines/29144 characters - positive lookahead search takes 75.012 seconds

So typically a doubling of the file size quadruples the search time. Note that if I do a simple regex search on the 80 line file the time is 0.14s compared to 0.13 seconds using a non-regex search so the positive lookahead really kills it. My problem was that my regex search was hitting a file that was about 400Mb in size hence the whole thing appeared to hang.

One other interesting thing I observed was that when I ran such a search, looking at the task manager on my laptop showed that fileseek was consuming about 25% of my CPU cycles. I simply stopped the search and the pause and stop buttons greyed out seemingly indicating that the search had stopped but fileseek continued to consume 25% of the CPU. If I then restarted the search fileseek started to consume 50% of my CPU. Repeating the cycle jumped it another 25%. So apparently stopping the search doesn't stop the thread doing the work and if you do this a few times your computer grinds to a halt. However closing fileseek completely does resolve the problem.

FYI I was using a number of positive lookaheads as I was trying to do the regex equivalent of an AND function

Oct 24, 2014 (modified Oct 24, 2014) • #1

RegEx lookaheads are really intensive, unfortunately. There's not much we can do there in terms of performance. Here's an interesting thread on RegEx performance: http://stackoverflow.com/questions/802003/pb-net-c-regex-match-never-ends (not required reading, I just found it interesting

)

As for aborting the search, you're correct, at the moment we wait for it to finish the current file before terminating the thread. This is to avoid leaving a file lock open on the file that's being searched. We can certainly look into whether there's a way to work around that though.

Could you tell me the RegEx string that you're using? We'll test it out here with a huge file to see if we can improve the search stopping code

Thanks!

Oct 24, 2014 (modified Oct 24, 2014) • #2

Paul R1

14 discussion posts

(?=.*\boracle\b)(?=.*\baudit).*

Ie a regex version of oracle+audit but using \b to delimit the words in order to prevent matches to other words like plaudits (actualy audit* hence no \b after audit)

I thought it might be an unavoidable issue but now I know about it I can make some compromises and work around it.

It's probably worth making it clear that such searches can rapidly become a problem as it looks to the user like the program has simply hung. Also being able to cancel it woul be good because waiting for a really large textfile to complete isn't going to be practical (extrapolating my results suggests my 400Mb file would have taken 600 years to complete and I'm in a bit more of a hurry than that ☺)

Oct 24, 2014 (modified Oct 24, 2014) • #3

Yeah, I don't think anyone has patience to wait that long

You should be able to do that search in FileSeek without RegEx. Could you try disabling the Query is RegEx option, and use the following instead? (including the quotes)

Code
Copy
Select All

" oracle " +" audit "

The spaces before and after the the words in the quotes should make sure that they only match them as their own words.

Oct 27, 2014 • #4

Paul R1

14 discussion posts

That only works if the words you are looking for are not at the start or end of a line of text.

Anyway I suggest you close this one as I understand the limitation and can work round it.

Oct 27, 2014 • #5

Ok, no worries! One of our devs checked into this further, and here's his feedback:

"I saw his regex and it looks like he's causing this issue:

http://www.regular-expressions.info/catastrophic.html

His regex is not specific enough. I wrote him a regex that could work for him, but I can only guess because I dont have a sample of his data:

((?<=\boracle\b).*(?=\baudit\b))|((?<=\baudit\b).*(?=\boracle\b))

Or, he could use our searching with this query

("oracle " " oracle " " oracle") +("audit " " audit " " audit")

I tested it on a document with 54038 characters and it did the regex search in 0.156s,
and the text query search in 0.145s"

Hope that helps!

Oct 28, 2014 (modified Oct 28, 2014) • #6