waits77
12-26-2004, 06:43 AM
When I say fast, I'm not talking about execution speed alone. I'm talking about total man hours to get the user the desired results. It's false economy to spend hours trying to trim a tenth of a second off of the execution speed on code that only runs a hundred times a day.
Regular Expressions are generally very economical. They execute relatively quickly, they're quick to code and they're easy to modify. The downside is that they require a bit of learning. Some people even refer to Regular Expressions as a language unto itself.
I'm not going to go into the nuts and bolts of the "language". If this example of the .Execute method piques your interest, a detailed explaination of Regular Expressions can be found at MSDN:
http://msdn.microsoft.com/library/default.asp?url=/library/en-us/dnclinic/html/scripting051099.asp
Another great resource is an extensive library of Regular Expression solutions and an expression tester at:
http://www.regexlib.com/RETester.aspx
Regular Expressions allow us to match patterns of text. For instance, a Social Security number such as "015-47-1382" can be thought of as a pattern of 3 digits followed by a hyphen followed by 2 digits followed by a hyphen followed by 4 digits. Some possible .Patterns for it are "\d\d\d\-\d\d\-\d\d\d\d" and "\d{3}\-\d{2}\-\d{4}" and "[0-9]{3}\-[0-9]{2}\-[0-9]{4}" and so on. Once the .Pattern is defined, the .Execute method extracts text matching the pattern and places it in a collection of Match objects.
For this example, we're going to extract XYZ data from 3 very different file formats. The first, format1, looks like the following:
SET1=BEGINSET 00000029
MDI1 = MDI/55, 00000030
2.9314, 0.3636, 0.3731, 0.001146,-0.999999,-0.000168, 00000031
2.9549, 0.3351, 0.3785, 0.001146,-0.999999,-0.000168, 00000032
We want to extract the data I've marked bold. A likely .Pattern is 4 or more spaces followed by 1 or more numeric characters followed by "," followed by 4 or more spaces followed by 1 or more numeric characters ... " {4,}[\d\.\-]{1,}, {4,}[\d\.\-]{1,}, {4,}[\d\.\-]{1,}," which would match:
2.9314, 0.3636, 0.3731,
But what we really want is just the numbers. For that, we can use .SubMatches. Placing "()" in the pattern will identify submatches within the match " {4,}([\d\.\-]{1,}), {4,}([\d\.\-]{1,}), {4,}([\d\.\-]{1,}),"
The second file format, format2:
~ Select `sellist0` `Namelist` \
1 `PNT431`
!%CIShowing POINT in feature 96 in model xxxxxxxxxxxxx.
!%CIDistance = 0.456320; dx= -0.456320, dy= 0, dz= 0.69034000.
!%CPSelect second object.
~ Activate `measure` `DistToSelect`
The .Pattern could be "CIDist" followed by 1 or more characters that are not new line characters followed by one or more spaces followed by "dx=" followed by 1 or more spaces followed by 1 or more numeric characters ... "CIDist[^\n]+ +dx= +([\d\.\-]+), +dy= +([\d\.\-]+), +dz= +([\d\.\-]+)\." Just got a little more complex. First, the "+" is the same as "{1,}". Second, the "^" inside of "[]" means "not", so "[^\n]" means not new line characters, so "[^\n]+" means 1 or more of any character so long as they are all on the same line. Note that the "\n" for new line will match either vbLf or vbCrLf so it will work on Windows files as well as Unix files.
The third file format, format3:
154 SP PNT X -0.00019
Y -0.35387
Z -0.02956
D -0.00019
155 154* SP PNT X -0.00019
Y -0.35387
Z -0.02956
D -0.00019
S .00000 FORM .00000
** RESULT V10SA D
8 DISPLAC X 0.00000
Y 0.00000
Z 1.00000
9 4! SP PNT X 0.65289
Y -0.35387
Z -0.48834
D 0.00789
S .00000 FORM .00000
10 5! SP PNT X 0.65315
Y -0.14553
Z -0.48834
D 0.00815
Here, the thing that sets the data apart is the "RESULT V10SA" that follows the XYZ values we want. In this file, the "V10SA" can be any sequence of 5 digits and upper case letters, so our .Pattern will end in "RESULT [A-Z\d]{5}" So, how about " +X +([\d\.\-]+)\n +Y +([\d\.\-]+)\n +Z +([\d\.\-]+)\n(?:[^\n]{0,}\n){4}[^\n]+ RESULT [A-Z\d]{5}" A little more complex. In addition to delineating submatches, "()" also groups characters. In this case, so we can repeat a group 4 times "{4}". To prevent the engine from treating this set as a submatch, we use "?:" following the "(". What the .Pattern says is:
1 or more spaces
followed by "X"
followed by 1 or more spaces
followed by (1 or more numeric characters) as a submatch
followed by new line
followed by 1 or more spaces
followed by "Y"
followed by 1 or more spaces
followed by (1 or more numeric characters) as a submatch
followed by new line
followed by 1 or more spaces
followed by "Z"
followed by 1 or more spaces
followed by (1 or more numeric characters) as a submatch
followed by new line
followed by (0 or more of any character that's not new line
followed by new line) 4 of theses => 4 lines of any content
followed by 1 or more of any character that's not new line
followed by "RESULT "
followed by 5 upper case letters and or digits
Our 3 .Patterns are:
" {4,}([\d\.\-]{1,}), {4,}([\d\.\-]{1,}), {4,}([\d\.\-]{1,}),"
"CIDist[^\n]+ +dx= +([\d\.\-]+)[,\.] +dy= +([\d\.\-]+)[,\.] +dz= +([\d\.\-]+)[,\.]"
" +X +([\d\.\-]+)\n +Y +([\d\.\-]+)\n +Z +([\d\.\-]+)\n(?:[^\n]{0,}\n){4}[^\n]+ RESULT [A-Z\d]{5}"
The hard part is done. Now all we have to do is read a file, see which pattern is present (.Test) and then extract the data (.Execute).
The following code requires the "Microsoft Common Dialog Control 6.0" (COMDLG32.OCX) and a reference to "Microsoft VBScript Regular Expressions 5.5" (VBSCRIPT.DLL).
Option Explicit
Const FORMAT_1 = " {4,}([\d\.\-]{1,}), {4,}([\d\.\-]{1,}), {4,}([\d\.\-]{1,}),"
Const FORMAT_2 = "CIDist[^\n]+ +dx= +([\d\.\-]+)[,\.] +dy= +([\d\.\-]+)[,\.] +dz= +([\d\.\-]+)[,\.]"
Const FORMAT_3 = " +X +([\d\.\-]+)\n +Y +([\d\.\-]+)\n +Z +([\d\.\-]+)\n(?:[^\n]{0,}\n){4}[^\n]+ RESULT [A-Z\d]{5}"
Private Type dblPoint
X As Double
Y As Double
Z As Double
End Type
Dim Points() As dblPoint
Dim regXYZ As RegExp
Private Sub Command1_Click()
Dim intFNum As Integer
Dim intFlag As Integer
Dim i As Long
Dim strFName As String
Dim strFile As String
Dim varXYZ As Variant
CommonDialog1.ShowOpen ' get the filename
strFName = CommonDialog1.FileName
intFNum = FreeFile ' next available file number
Open strFName For Binary As #intFNum
strFile = Space(LOF(intFNum)) ' size the string to hold the file
Get #intFNum, , strFile ' read the file
Close intFNum
Set regXYZ = New RegExp ' establish the expression
With regXYZ
.Pattern = FORMAT_3 ' set the .Pattern
If .Test(strFile) Then ' test to see if the pattern is present
' got a match so FORMAT_3 is the pattern
intFlag = 1
Else
.Pattern = FORMAT_2
If .Test(strFile) Then
' got a match so FORMAT_2 is the pattern
intFlag = 1
Else
.Pattern = FORMAT_1
If .Test(strFile) Then
' got a match so FORMAT_1 is the pattern
intFlag = 1
Else
MsgBox "Did not match pattern."
intFlag = 0 ' not needed, but reads better
End If
End If
End If
.Global = True ' defaults to looking for first match only
End With
If intFlag Then ' if we got a pattern match
Set varXYZ = regXYZ.Execute(strFile) ' extract the data
Set regXYZ = Nothing ' reclaim the object
ReDim Points(varXYZ.Count - 1) ' size the array to hold the data
For i = 0 To varXYZ.Count - 1 ' loop through and assign array values
' Debug.Print varXYZ(i)
With varXYZ(i)
' Debug.Print .SubMatches(0)
' Debug.Print .SubMatches(1)
' Debug.Print .SubMatches(2)
Points(i).X = CDbl(.SubMatches(0))
Points(i).Y = CDbl(.SubMatches(1))
Points(i).Z = CDbl(.SubMatches(2))
End With
Next i
MsgBox "Done with " & CStr(varXYZ.Count) & " matches."
Set varXYZ = Nothing ' reclaim the object
End If
End Sub
Here we have three very different file formats all being parsed using almost identical code. The only thing that changes is the .Pattern. Regular Expressions are worth the effort to learn the "language".
Regular Expressions are generally very economical. They execute relatively quickly, they're quick to code and they're easy to modify. The downside is that they require a bit of learning. Some people even refer to Regular Expressions as a language unto itself.
I'm not going to go into the nuts and bolts of the "language". If this example of the .Execute method piques your interest, a detailed explaination of Regular Expressions can be found at MSDN:
http://msdn.microsoft.com/library/default.asp?url=/library/en-us/dnclinic/html/scripting051099.asp
Another great resource is an extensive library of Regular Expression solutions and an expression tester at:
http://www.regexlib.com/RETester.aspx
Regular Expressions allow us to match patterns of text. For instance, a Social Security number such as "015-47-1382" can be thought of as a pattern of 3 digits followed by a hyphen followed by 2 digits followed by a hyphen followed by 4 digits. Some possible .Patterns for it are "\d\d\d\-\d\d\-\d\d\d\d" and "\d{3}\-\d{2}\-\d{4}" and "[0-9]{3}\-[0-9]{2}\-[0-9]{4}" and so on. Once the .Pattern is defined, the .Execute method extracts text matching the pattern and places it in a collection of Match objects.
For this example, we're going to extract XYZ data from 3 very different file formats. The first, format1, looks like the following:
SET1=BEGINSET 00000029
MDI1 = MDI/55, 00000030
2.9314, 0.3636, 0.3731, 0.001146,-0.999999,-0.000168, 00000031
2.9549, 0.3351, 0.3785, 0.001146,-0.999999,-0.000168, 00000032
We want to extract the data I've marked bold. A likely .Pattern is 4 or more spaces followed by 1 or more numeric characters followed by "," followed by 4 or more spaces followed by 1 or more numeric characters ... " {4,}[\d\.\-]{1,}, {4,}[\d\.\-]{1,}, {4,}[\d\.\-]{1,}," which would match:
2.9314, 0.3636, 0.3731,
But what we really want is just the numbers. For that, we can use .SubMatches. Placing "()" in the pattern will identify submatches within the match " {4,}([\d\.\-]{1,}), {4,}([\d\.\-]{1,}), {4,}([\d\.\-]{1,}),"
The second file format, format2:
~ Select `sellist0` `Namelist` \
1 `PNT431`
!%CIShowing POINT in feature 96 in model xxxxxxxxxxxxx.
!%CIDistance = 0.456320; dx= -0.456320, dy= 0, dz= 0.69034000.
!%CPSelect second object.
~ Activate `measure` `DistToSelect`
The .Pattern could be "CIDist" followed by 1 or more characters that are not new line characters followed by one or more spaces followed by "dx=" followed by 1 or more spaces followed by 1 or more numeric characters ... "CIDist[^\n]+ +dx= +([\d\.\-]+), +dy= +([\d\.\-]+), +dz= +([\d\.\-]+)\." Just got a little more complex. First, the "+" is the same as "{1,}". Second, the "^" inside of "[]" means "not", so "[^\n]" means not new line characters, so "[^\n]+" means 1 or more of any character so long as they are all on the same line. Note that the "\n" for new line will match either vbLf or vbCrLf so it will work on Windows files as well as Unix files.
The third file format, format3:
154 SP PNT X -0.00019
Y -0.35387
Z -0.02956
D -0.00019
155 154* SP PNT X -0.00019
Y -0.35387
Z -0.02956
D -0.00019
S .00000 FORM .00000
** RESULT V10SA D
8 DISPLAC X 0.00000
Y 0.00000
Z 1.00000
9 4! SP PNT X 0.65289
Y -0.35387
Z -0.48834
D 0.00789
S .00000 FORM .00000
10 5! SP PNT X 0.65315
Y -0.14553
Z -0.48834
D 0.00815
Here, the thing that sets the data apart is the "RESULT V10SA" that follows the XYZ values we want. In this file, the "V10SA" can be any sequence of 5 digits and upper case letters, so our .Pattern will end in "RESULT [A-Z\d]{5}" So, how about " +X +([\d\.\-]+)\n +Y +([\d\.\-]+)\n +Z +([\d\.\-]+)\n(?:[^\n]{0,}\n){4}[^\n]+ RESULT [A-Z\d]{5}" A little more complex. In addition to delineating submatches, "()" also groups characters. In this case, so we can repeat a group 4 times "{4}". To prevent the engine from treating this set as a submatch, we use "?:" following the "(". What the .Pattern says is:
1 or more spaces
followed by "X"
followed by 1 or more spaces
followed by (1 or more numeric characters) as a submatch
followed by new line
followed by 1 or more spaces
followed by "Y"
followed by 1 or more spaces
followed by (1 or more numeric characters) as a submatch
followed by new line
followed by 1 or more spaces
followed by "Z"
followed by 1 or more spaces
followed by (1 or more numeric characters) as a submatch
followed by new line
followed by (0 or more of any character that's not new line
followed by new line) 4 of theses => 4 lines of any content
followed by 1 or more of any character that's not new line
followed by "RESULT "
followed by 5 upper case letters and or digits
Our 3 .Patterns are:
" {4,}([\d\.\-]{1,}), {4,}([\d\.\-]{1,}), {4,}([\d\.\-]{1,}),"
"CIDist[^\n]+ +dx= +([\d\.\-]+)[,\.] +dy= +([\d\.\-]+)[,\.] +dz= +([\d\.\-]+)[,\.]"
" +X +([\d\.\-]+)\n +Y +([\d\.\-]+)\n +Z +([\d\.\-]+)\n(?:[^\n]{0,}\n){4}[^\n]+ RESULT [A-Z\d]{5}"
The hard part is done. Now all we have to do is read a file, see which pattern is present (.Test) and then extract the data (.Execute).
The following code requires the "Microsoft Common Dialog Control 6.0" (COMDLG32.OCX) and a reference to "Microsoft VBScript Regular Expressions 5.5" (VBSCRIPT.DLL).
Option Explicit
Const FORMAT_1 = " {4,}([\d\.\-]{1,}), {4,}([\d\.\-]{1,}), {4,}([\d\.\-]{1,}),"
Const FORMAT_2 = "CIDist[^\n]+ +dx= +([\d\.\-]+)[,\.] +dy= +([\d\.\-]+)[,\.] +dz= +([\d\.\-]+)[,\.]"
Const FORMAT_3 = " +X +([\d\.\-]+)\n +Y +([\d\.\-]+)\n +Z +([\d\.\-]+)\n(?:[^\n]{0,}\n){4}[^\n]+ RESULT [A-Z\d]{5}"
Private Type dblPoint
X As Double
Y As Double
Z As Double
End Type
Dim Points() As dblPoint
Dim regXYZ As RegExp
Private Sub Command1_Click()
Dim intFNum As Integer
Dim intFlag As Integer
Dim i As Long
Dim strFName As String
Dim strFile As String
Dim varXYZ As Variant
CommonDialog1.ShowOpen ' get the filename
strFName = CommonDialog1.FileName
intFNum = FreeFile ' next available file number
Open strFName For Binary As #intFNum
strFile = Space(LOF(intFNum)) ' size the string to hold the file
Get #intFNum, , strFile ' read the file
Close intFNum
Set regXYZ = New RegExp ' establish the expression
With regXYZ
.Pattern = FORMAT_3 ' set the .Pattern
If .Test(strFile) Then ' test to see if the pattern is present
' got a match so FORMAT_3 is the pattern
intFlag = 1
Else
.Pattern = FORMAT_2
If .Test(strFile) Then
' got a match so FORMAT_2 is the pattern
intFlag = 1
Else
.Pattern = FORMAT_1
If .Test(strFile) Then
' got a match so FORMAT_1 is the pattern
intFlag = 1
Else
MsgBox "Did not match pattern."
intFlag = 0 ' not needed, but reads better
End If
End If
End If
.Global = True ' defaults to looking for first match only
End With
If intFlag Then ' if we got a pattern match
Set varXYZ = regXYZ.Execute(strFile) ' extract the data
Set regXYZ = Nothing ' reclaim the object
ReDim Points(varXYZ.Count - 1) ' size the array to hold the data
For i = 0 To varXYZ.Count - 1 ' loop through and assign array values
' Debug.Print varXYZ(i)
With varXYZ(i)
' Debug.Print .SubMatches(0)
' Debug.Print .SubMatches(1)
' Debug.Print .SubMatches(2)
Points(i).X = CDbl(.SubMatches(0))
Points(i).Y = CDbl(.SubMatches(1))
Points(i).Z = CDbl(.SubMatches(2))
End With
Next i
MsgBox "Done with " & CStr(varXYZ.Count) & " matches."
Set varXYZ = Nothing ' reclaim the object
End If
End Sub
Here we have three very different file formats all being parsed using almost identical code. The only thing that changes is the .Pattern. Regular Expressions are worth the effort to learn the "language".