Paul R1
14 discussion posts
Keith,
I've observed another behaviour which I'd like to bring to your attention regarding regex searches. I was finding that my searches were apparently just hanging. After some investigation what I've found is that it's not a hang but rather an exponentially increasing search time when carrying out a regex search on a text file. I found it on a very large csv file but the same is true for text files as they use the same handler (query.dll).
Actually it seems to be because I'm using a positive lookahead so it may just be a fact of life but I tried doing a search on a text file that I kept doubling in size with the following results:
test 1 - 20 lines/726 characters - positive lookahead search takes 4.815 seconds
test 2 - 40 lines/14572 characters - positive lookahead search takes 18.884 seconds
test 3 - 80 lines/29144 characters - positive lookahead search takes 75.012 seconds
So typically a doubling of the file size quadruples the search time. Note that if I do a simple regex search on the 80 line file the time is 0.14s compared to 0.13 seconds using a non-regex search so the positive lookahead really kills it. My problem was that my regex search was hitting a file that was about 400Mb in size hence the whole thing appeared to hang.
One other interesting thing I observed was that when I ran such a search, looking at the task manager on my laptop showed that fileseek was consuming about 25% of my CPU cycles. I simply stopped the search and the pause and stop buttons greyed out seemingly indicating that the search had stopped but fileseek continued to consume 25% of the CPU. If I then restarted the search fileseek started to consume 50% of my CPU. Repeating the cycle jumped it another 25%. So apparently stopping the search doesn't stop the thread doing the work and if you do this a few times your computer grinds to a halt. However closing fileseek completely does resolve the problem.
FYI I was using a number of positive lookaheads as I was trying to do the regex equivalent of an AND function
Oct 24, 2014 (modified Oct 24, 2014)
•
#1
Paul R1
14 discussion posts
(?=.*\boracle\b)(?=.*\baudit).*
Ie a regex version of oracle+audit but using \b to delimit the words in order to prevent matches to other words like plaudits (actualy audit* hence no \b after audit)
I thought it might be an unavoidable issue but now I know about it I can make some compromises and work around it.
It's probably worth making it clear that such searches can rapidly become a problem as it looks to the user like the program has simply hung. Also being able to cancel it woul be good because waiting for a really large textfile to complete isn't going to be practical (extrapolating my results suggests my 400Mb file would have taken 600 years to complete and I'm in a bit more of a hurry than that ☺)
Oct 24, 2014 (modified Oct 24, 2014)
•
#3
Paul R1
14 discussion posts
That only works if the words you are looking for are not at the start or end of a line of text.
Anyway I suggest you close this one as I understand the limitation and can work round it.
Ok, no worries! One of our devs checked into this further, and here's his feedback:
"I saw his regex and it looks like he's causing this issue:
http://www.regular-expressions.info/catastrophic.html
His regex is not specific enough. I wrote him a regex that could work for him, but I can only guess because I dont have a sample of his data:
((?<=\boracle\b).*(?=\baudit\b))|((?<=\baudit\b).*(?=\boracle\b))
Or, he could use our searching with this query
("oracle " " oracle " " oracle") +("audit " " audit " " audit")
I tested it on a document with 54038 characters and it did the regex search in 0.156s,
and the text query search in 0.145s"
Hope that helps!
Oct 28, 2014 (modified Oct 28, 2014)
•
#6