shivanshu 08-18-2007, 01:13 PM my current project is with a company which has over 30 thousand database entries. each Entry has a supported ".htm" File. Till now they just wanted me to open the Html File with Respective Record. But now they want me to alow them search for a string in all files:eek:...
my head was first jammed...but now i think there must be some faster algorithms or indexing to cope with searching in thousands of file...
now what i don't know is:
-How to Search in a Html File:confused:
-How to make this search fast for thousands of file:huh:
i just want to return true if file contain that string like "Hello World"
and add the name of file in list box...
my file names are the "File_Id" of every record like record with id ASD34V has a support file ASD34V.htm in folder - ./Summary
i am using vb6
pls...pls..help...
OnErr0r 08-18-2007, 01:42 PM Since you're searching all files, it sounds as though you could just use the Dir$ function in a loop to enumerate filenames. From there, open the file with an open statement. Once you've opened the file and read in into a string (Input function), you could simply call the Instr function to search for the substring.
HTML is just text, so it's not terribly hard to search. The only thing that might complicate matters would be discarding searches which match text that is within an HTML tag or possibly script.
shivanshu 08-18-2007, 03:15 PM m doing the same thing till now....but html tags were one problem and searching in 20 thousand files was headace for the user....so i thought of using some algo kinda stuff....m not even able to tackel the html tages...pls help...
OnErr0r 08-18-2007, 05:27 PM I'm not certain I know what you mean by, "headache to the user". Are you talking about a tight loop? (one which does not yield and fails to repaint) If so, then use a DoEvents in your loop. This would also allow you to check a flag for the user to cancel before the entire loop finishes.
Ignoring HTML tags can be handled by first searching for ">", then using that position as a starting point to search for "<" By subtracting the two positions you get the length of the string between the tags. Obviously you would have to subtract one for the true length. You could then Instr on the string (returned by Mid$) and store it if it matches.
shivanshu 08-19-2007, 10:47 AM thanks for solving html tag problem...but the main problem i.e "headace for the user" is his "patience"....i have to search in files from 1,2,3,4........33000.
now if i want to search "a good boy" i have to loop for every single file from 1 to 33000.
then for every file search for html tag...then Search for "a good boy" if found then leave that file, return true, and move to next file....
how much this will take time....10 min or 15 mins....i think user is goona break his screen 3rd time....
so i want a kind of a fast algorithm something like "boyer's" algorithm in replace of "instr" which accepts "exceptions i.e. html tags" and is super duper faster than instr...
thanks for ur response
OnErr0r 08-19-2007, 10:52 AM Boyer-Moore might lead to faster searches or it might not. It really depends on the length of the string for which you are searching, as I describe in this post (http://www.xtremevbtalk.com/showpost.php?p=860712&postcount=19).
You might want to display a progressbar for the user, which tends to give them something to look at and makes the time seem to go by faster. Also, I'd say a cancel button would be a must, since the process might take that long.
shivanshu 08-19-2007, 03:01 PM thanks thanks a lot for give this precious information on using boyers method. i have already been using a progress bar with max value as no. of files and value as count of files processed with a DoEvent.
i used the example provied by u for boyers algo. but was not able to exclude the html tags...
pls...pls... help...
OnErr0r 08-19-2007, 03:21 PM but was not able to exclude the html tags...
Please ask some specific questions and describe in detail what you did not understand.
mkaras 08-19-2007, 04:18 PM You should evaluate your overall system performance before deciding that the InStr() function is too slow. I suspect that most of the thrash in dealing with 33000 files is due to the disk system and the OS handling of the directory folders and so forth.
If you are using Windows XP or 2K you could try a small experiment. Open Windows File Explorer program. Then open a search (typically launched via F3 key from the Windows Explorer window) and then setup to have that program perform a search of your 33000 files with the same type of simple string search in the files. Run this and note the time for the search to complete. For a given computer hardware configuration it is very unlikely that you can write a faster search in Visual Basic than the OS performs in the experiment. You may want to perform the search experiment several times to see the speed variation that even Windows will achieve. If you have setup a large amount of vitrual cache memory for Windows to work with many files and or the directory sectors where the files are stored on disk can be cached in RAM by the OS this making subsequent searches even faster.
Michael Karas
Rockoon 08-19-2007, 07:25 PM If a lot of searches will be performed on the same files then there are strategies to eliminate a lot of work at the expense of needing to preparse the files and cache some information (perhaps with a Bloom Hash)
shivanshu 08-20-2007, 02:26 PM now here i am with an example of InStr on 10 Files...
all other thousand files are similar to these...
these files are html files with simple html tags...
now the flaws in this example are:
-Slow on Large no. of files
-Case Sensitive ( don't need case sensitive Search)
-Show html tags in search
what i exactly wanted to know is whether this example can be changed into a faster search like "google" search engine....
i guess the case sensitive problem can be removed by ucase(file)...but that is again compromise with speed...
i actually don't know how the excludes the html tags or helping verbs etc. without loosing the speed...
Flyguy 08-20-2007, 02:57 PM The Google databases and search engines is not a single computer with text based search.
Serious:
http://www.google.com/technology/
Funny:
http://www.google.com/technology/pigeonrank.html
Using GoogleDesktop:
http://desktop.google.com/dev/index.html
mkaras 08-20-2007, 03:53 PM I looked at the example you uploaded above. I wanted to compare what you were doing against a Windows explorer search like I had suggested above.
I took the 10 sample files and copied them over and over into the .\Text subfolder till I had total of 5650 files with names like "Copy (1) of 1.HTM" to "Copy (564) of 1.HTM" and so forth.
I then ran an Explorer search on these with the search string "Here we go" and it took 45 seconds to do the whole thing on my 2.65 GHz unit here with 1G of RAM.
To make an equivalent comparison in VB6 I had to re-do the code you sent so it would run and work. I then timed it in a full Instr() search on all the files and it took 7 seconds (give or take 1).
Seems to me that the VB search is not all that bad. :-)
Michael Karas
OnErr0r 08-20-2007, 04:45 PM now the flaws in this example are:
-Slow on Large no. of files
As the number of files increases, so does the length of time, this is logical with a search of each file. If the files to be searched don't change, or even if they change infrequently, you could always preprocess them and cache a list of words found in each file. Rockoon alluded to this previously.
-Case Sensitive ( don't need case sensitive Search)
This can also be accomplished using Instr with vbTextCompare, as opposed to vbBinaryCompare (the default). And, as you mentioned, this is a slower search.
-Show html tags in search
I explained this to you already and am still waiting for some explanation of what exactly you don't understand. It's basically a matter of performing the Instr you already have if and when two other instr's succeed. Consider separating this off to a function, for less confusion.
shivanshu 08-21-2007, 01:55 PM ok..now m able to apply a non-case sensitive search....:)
then i am able to exclude html tags:D
i have attached the new code...
but the thing m worried about - it took 1.25 mins to search in 3000 files:rolleyes:
Flyguy 08-21-2007, 03:10 PM I think your RemoveHtmlTags function is quite slow, because it's starts searching from the start of the file over and over again.
Try this one:
Private Function StripText(sData As String) As String
Dim lPos1 As Long, lPos2 As Long
Dim lPos As Long, lLen As Long
Dim sText As String
' Create a buffer to store the stripped text
' Maximum size is the current data
StripText = Space$(Len(sData))
lPos = 1
Do
' Text starts after a ">"
lPos1 = InStr(lPos2 + 1, sData, ">")
If lPos1 > 0 Then
' Text stops before a "<"
lPos2 = InStr(lPos1 + 1, sData, "<")
If lPos2 > 0 Then
' The text between the tags
sText = Mid$(sData, lPos1 + 1, lPos2 - lPos1 - 1)
lLen = Len(sText)
' Put the text in the buffer
Mid$(StripText, lPos, lLen) = sText
lPos = lPos + lLen
End If
End If
Loop Until lPos2 = 0 Or lPos1 = 0
StripText = Left$(StripText, lPos)
End Function
Instead of
Call RemoveHtmlTags(s)
Use
s = StripText(s)
shivanshu 09-03-2007, 02:44 PM thanks a lot for better html tags remove function....
m replaying late but u gave me what i needed...mine was dam slow...buts urs is cools...thanks again
|