Extract specific data from strings (text files)
Extract specific data from strings (text files)
Extract specific data from strings (text files)
Extract specific data from strings (text files)
Extract specific data from strings (text files)
Extract specific data from strings (text files) Extract specific data from strings (text files) Extract specific data from strings (text files) Extract specific data from strings (text files) Extract specific data from strings (text files) Extract specific data from strings (text files) Extract specific data from strings (text files) Extract specific data from strings (text files)
Extract specific data from strings (text files) Extract specific data from strings (text files)
Extract specific data from strings (text files)
Go Back  Xtreme Visual Basic Talk > > > Extract specific data from strings (text files)


Reply
 
Thread Tools Display Modes
  #1  
Old 05-25-2008, 04:43 PM
3dkingpin 3dkingpin is offline
Junior Contributor
 
Join Date: Feb 2008
Posts: 256
Default Extract specific data from strings (text files)


This is first real solo coding effort in vb. what im trying to do is extract emails from a text file. This text file contains email amongst other things but i want to extract and save the emails only in to another text file.

What ive done is below. First find the position of '@' then work backwords till i find a space etc... then work forward from position of '@' until another non-email character is found. Then remove this and start again, see below. I have problems with the string formatting. For example

Is the code correct or can you suggets a better way?


PHP Code:
    Private Sub Button1_Click(ByVal sender As System.ObjectByVal e As System.EventArgsHandles Button1.Click
        Dim _string 
As String "rafhelp@yahoo.co.uk;bibi@yoyo.com   ;john@gmail.com  fart@joejoe.com"
        
Dim str As String
        Dim atpos
sposepos_len As Double
        Dim i 
As Integer
        txt_emails
.Clear()

        If 
txt_filename.Text "" Then
            MsgBox
("Error Please select an input file first")
            Exit 
Sub
        End 
If

        If 
file_exists(txt_filename.Text) = False Then
            MsgBox
("Input file does not exist please select another!")
            Exit 
Sub
        End 
If

        
Dim _string As String make_string(txt_filename.Text)

        If 
_string "" Or InStr(1_string"@"CompareMethod.Binary) = 0 Then
            MsgBox
("Input file does not contain any valid emails")
            Exit 
Sub
        End 
If

        Do 
Until InStr(1_string"@"CompareMethod.Binary) = 0
            _len 
Len(_string)

            
atpos InStr(1_string"@"CompareMethod.Binary)
            If 
atpos 0 Then Exit Do

            
spos 0

            
For atpos To 0 Step -1
                
If _string.Chars(i) = " " Or _
                _string
.Chars(i) = "," Or _
                _string
.Chars(i) = "," Or _ Then
                    spos 
= (1)
                    Exit For
                
End If
            
Next


            
For atpos To (_len 1Step 1
                
If _string.Chars(i) = " " Or _
                   _string
.Chars(i) = "," Or _
                   _string
.Chars(i) = ";" Then
                    epos 
i
                    
Exit For
                
End If
            
Next

            str 
Mid(_stringspos 1, (epos spos))
            
str Trim(str)
            
'MsgBox(str)
            txt_emails.AppendText(str + vbCrLf)
            If _len - epos > 0 Then
                _string = Microsoft.VisualBasic.Right(_string, _len - epos)
            Else
                _string = ""
            End If
            '
MsgBox("Left With:" Chr(10) + _string)
        
Loop
    End Sub 

Last edited by 3dkingpin; 05-25-2008 at 04:48 PM.
Reply With Quote
  #2  
Old 05-27-2008, 08:58 AM
jo0lsExtract specific data from strings (text files) jo0ls is offline
Senior Contributor

Forum Leader
* Expert *
 
Join Date: Feb 2005
Location: London
Posts: 1,050
Default

Regular expressions are ideal for this sort of thing. You create a regular expression, this is a String that can represent various other Strings. For example the regular expression "[A-Z]" can stand for any character from A to Z. Creating a suitable regular expression can be difficult, but ones that will match any valid e-mail address are easy to find with a web search. You can learn how they work by reading through a regular expressions tutorial. Once you have suitable regular expression the steps would be to:

1) Load the file into a String
2) Create a new Regex object using the pattern that matches emails.
3) Create a MatchCollection object, and assign to it the results you get from calling the Matches method of the Regex object you created in (2)
4) Loop through the collection to read the results.

See the msdn topic on MatchCollection. The advantage is that it should be about 5-6 lines of code for the function.

Notes on your code.

1) It's always good to code it yourself!
2) You've chosen Double for some of the variables, but they all store integer (whole number) values. Double is best for floating point numbers. They would be better as Integer.
3) You've declared _string twice, so presumably you've pasted bits from your program!

The output I got from a test program was:

Code:
rafhelp@yahoo.co.uk
;bibi@yoyo.com
;john@gmail.com
fart@joejoe.com
So you have got some extra characters in there. It looks like it is just a typo:

For i = atpos To 0 Step -1
If _string.Chars(i) = " " Or _
_string.Chars(i) = "," Or _
_string.Chars(i) = "," Then
spos = (i + 1)
Exit For
End If
Next


That , should be a ;

If the file containing the emails is large, then your technique will unfortunately choke. The problem is that it keeps starting again from scratch each time it finds an email.

It would be more efficient to have an algorithm that:

1) Loops through every character in the string:
For index As Integer = 0 to String.Length - 1
2) If the character at index is a @ then
a) Move backwards from index to find the start of the email
b) Move forwards from index to find the end of the email. Store the location of the end of the email in end.
c) Extract the email with SubString. Make a note of the email.
d) Set index to end
3) Next index

This way it doesn't backtrack to the start of the file for each email it finds.
Reply With Quote
  #3  
Old 05-27-2008, 12:13 PM
AtmaWeapon's Avatar
AtmaWeaponExtract specific data from strings (text files) AtmaWeapon is offline
Fabulous Florist

Forum Leader
* Guru *
 
Join Date: Feb 2004
Location: Austin, TX
Posts: 9,500
Default

Using regular expressions for finding emails might be painful, if you need to cover all of the cases. This page explains why. Take a gander at the full expression for valid email addresses; it is several lines of code long, and actually doesn't cover 100% of the allowable structures for an email address. You can use a more simple regular expression, but then you run the risk of missing some email addresses. In the end, there's no fool-proof way to catch 100% of valid emails because the standard left itself a little bit too open.

Honestly shunning regular expressions and writing a parser for the file format you specified isn't really that tough; the hard part is defining what you mean by "non e-mail" characters. If you consult RFC 2822 Internet Message Format, the document that defines the standard for email addresses, then you see that emails can contain the symbols !#$%*+-/=?^_`{|}~ in addition to any letter or digit, so they don't make good terminals. Technically, section 3.2.5 seems to allow even abnormal characters via quoted strings; I've never seen this used in practice but there you go.

I believe your parser made things a bit too complicated by focusing on the @ symbol in the address. The example string you have given doesn't really require using the @ symbol as a sentinel value; all email addresses are bounded by terminals. What follows is a pretty naive implementation of a parser that does what you asked for. It uses ; and common whitespace characters as terminals. It remembers characters until it finds a terminal, then decides if the characters it is remembering represent an email address by simply checking for an @ symbol. This check seems superflous but accidentally serves a purpose: if multiple terminals follow each other (as in the part of your string: "....com ;john..."), then accumulator has zero length and isn't a valid email address. The check for the @ symbol keeps these empty strings out of the list.

Code:
Imports StringBuilder = System.Text.StringBuilder

Module Module1

    Private ReadOnly Terminals() As Char = {";"c, " "c, CChar(vbTab), CChar(vbCr), CChar(vbLf)}

    Sub Main()
        Dim data As String = "rafhelp@yahoo.co.uk;bibi@yoyo.com   ;john@gmail.com  fart@joejoe.com"
        Dim emails() As String = FindEmails(data)

        For Each email As String In emails
            Console.WriteLine("|{0}|", email)
        Next
    End Sub

    Private Function FindEmails(ByVal input As String) As String()
        Dim emails As New List(Of String)()

        Dim accumulator As New StringBuilder()

        For currentIndex As Integer = 0 To input.Length - 1
            If Array.IndexOf(Terminals, input(currentIndex)) < 0 Then
                accumulator.Append(input(currentIndex))
            Else
                Dim potentialEmail As String = accumulator.ToString().Trim()

                ' Check for an @ symbol
                If potentialEmail.IndexOf("@") >= 0 Then
                    emails.Add(potentialEmail)
                End If

                accumulator.Remove(0, accumulator.Length)
            End If
        Next

        Return emails.ToArray()
    End Function

End Module
If your data set is more complicated, then it's probably worth making a more complicated parser, but it's hard to say whether regular expressions will be a boon or a curse without more information about the file structure.
__________________
.NET Resources
My FAQ threads | Tutor's Corner | Code Library
I would bet money 2/3 of .NET questions are already answered in one of these three places.
Reply With Quote
  #4  
Old 05-27-2008, 06:05 PM
3dkingpin 3dkingpin is offline
Junior Contributor
 
Join Date: Feb 2008
Posts: 256
Default

hmmmm!?

I got stuck. Then i went back to my old programming language and made a complete email extractor within about a day, it extracts emails from any text,csv,html file. It can add delimiters to email lists, separate emails into blocks of a specified number. You can now delete single emails from a email list, and also delete emails specified in a separate list from the email list. It works fine. Thanks for pointer out the non-email symbols. I better remove the { and } from my code as I was not aware you could use these in email addresses.

Anyways I might post the exe here if anyone wants to see it.

I liek the idea of creating a regular expression then using
For Each email As String In emails to simply extract each email from the entire string

that seems straight away in theory, i might try if i get time.


Heres one i made, not with VB though. I used a similar parser, in that i looked for @ symbols then worked back till a non-email character was found, then worked forward till non email character was found, it even covers the ".com.." problem caused by having to full stops together. Yes it may not look as good as a vb app but it does the job which is more important. There are a few bugs so i dont recommend anyone to use it for anything other than testing purposes. Try extracting emails from a csv file or html. It is quite fast id say.

Edit by moderator: Link removed due to .exe in .zip file

Last edited by webbone; 05-27-2008 at 10:35 PM.
Reply With Quote
  #5  
Old 05-27-2008, 08:55 PM
AtmaWeapon's Avatar
AtmaWeaponExtract specific data from strings (text files) AtmaWeapon is offline
Fabulous Florist

Forum Leader
* Guru *
 
Join Date: Feb 2004
Location: Austin, TX
Posts: 9,500
Default

If it's a zip with an executable, please remove the link. The Posting Guidelines point out that it's not allowed, and it's kind of against the spirit of the forum.

If the program is interesting, post the source and instructions on how to build it (personally I'm curious about this old language of yours).
__________________
.NET Resources
My FAQ threads | Tutor's Corner | Code Library
I would bet money 2/3 of .NET questions are already answered in one of these three places.
Reply With Quote
  #6  
Old 05-28-2008, 05:46 AM
3dkingpin 3dkingpin is offline
Junior Contributor
 
Join Date: Feb 2008
Posts: 256
Default

Fine if those are the rules then those are the rules, But really i see no harm in posting applications you have created. I know the reasons but the thing is, its up to the user to download and install and you would not download it if you didnt trust the author. Most people have virus checkers installed anyway or spyware tools removers.

I mean 'spirit of the forum', but I thought this is a programming site all about applications, source, exe the whole thing. The other thing is if someone posted a link to an exe just randomly for no reason, then yes you would question it but this was posted in response to the theme of the thread with the purpose of showing that the 'naive' implementation of a parser does actually work quite well when fixed up.

I dont want to debate this! And ok no more exe's. But what about .msi's?
__________________
Hey, Looking for entry level job in VB net programming. Got basic qualification, years of programming experience (mainly self taught) and good portfolio. (UK)
Reply With Quote
  #7  
Old 05-28-2008, 09:39 AM
AtmaWeapon's Avatar
AtmaWeaponExtract specific data from strings (text files) AtmaWeapon is offline
Fabulous Florist

Forum Leader
* Guru *
 
Join Date: Feb 2004
Location: Austin, TX
Posts: 9,500
Default

This rule is a good thing. Let's discuss some reasons why.

Authentication is not identification. Whether or not I believe I can trust you, the forums provide little guarantee that the person who created a post is who they say they are; your login is authentication, not identification. If someone manages to obtain your username and password, they can post malicious executables in your name and abuse the misplaced trust of forum users. The only guarantee I have is that whoever made your post has your username and password.

Executables teach nothing. This is a programming forum. It's about teaching people how to solve their problems using Visual Basic .NET. My primary goal is not to solve anyone's problem, but to teach them how they can solve it themselves the next time they're faced with it. I would hope it's the goal of the rest of the site's members as well. A compiled executable teaches nobody anything about implementing a solution to a problem; it simply shows a solution exists. Would you like it if I posted solutions to your problems in MSIL rather than VB .NET? It's still easily convertible into an executable, yet it provides you with no utility, does it? An executable is analogous: a black-box problem solver that doesn't help the user understand why it works.

Virus scanners are junk. Hundreds of new viruses are released every hour. Your virus scanner is likely updated once a day, if that often. Operating under the assumption, "I have a virus scanner therefore I'm safe" is analogous to saying, "I'm wearing my seatbelt, I will survive anything." Virus scanners provide a measure of protection from when a person who practices responsible computing makes a mistake. Opening a random executable from a forum that provides authentication rather than identification is a mistake. I believe the same thing Jeff Atwood believes about antivirus software: their protection relies on how complete the blacklist coverage is, and the race involves a few thousand security researchers vs. at least half a million virus writers. Virus scanners are an important part of a flawed security scheme.

Transparency. Even if your program's not a virus, it can accidentally do bad things. Your program might put some data files in a directory I don't want it to, access files I don't want it to, install hooks to attempt to log my keystrokes, spawn a lot of itself to consume too many resources, contain errors that will corrupt my data files, etc. Since it's in executable form, I have no idea if it does any of these things. If it were distributed in source form, I can look at the source before running it and verify that all is well. This may sound overly paranoid, but sometimes programmers with good intentions do bad things. For example, a third-party control used in another thread installs a low-level keyboard hook in order to handle keyboard input. Keyboard hooks can adversely affect system performance and present a security risk, since that program will intercept all keystrokes before the focused window even sees them. The only way I knew the hook was there is because the source was distributed rather than the DLL. This means I not only noticed the hook, but I had the ability to check the code to make sure it wasn't some form of keylogger. Had it been distributed as just a DLL, I'd have no idea.
__________________
.NET Resources
My FAQ threads | Tutor's Corner | Code Library
I would bet money 2/3 of .NET questions are already answered in one of these three places.
Reply With Quote
  #8  
Old 05-28-2008, 10:41 AM
3dkingpin 3dkingpin is offline
Junior Contributor
 
Join Date: Feb 2008
Posts: 256
Default

Like I said I know the reasons why and I did state I will not post exe links.

This is the only programming forum I have been on which does not allow posting of exe or compiled code.

Anyways, back to the thread again... I changed the parser:

Old parser:
Looked for "@" symbol, then worked back one char at a time. If this char was not a valid letter or digit, it then checked to see if it was a non-email character, to get the start of the address. It did the same the other way to find the end of the email.

New Parser:
I changed the old one which caused a few bugs. The reason was that, from what I read above I realized that there are about 19 allowed email symbols and it would be easier and more valid to check for these rather than checking for non-email symbols as there could be plenty (taking more time)

So now:
It looks for "@" symbol, then works back one char at a time. If this char is not a valid letter or digit, it then checks to see if it is not a valid email symbol aswell, if its not its stops and memorizes the location (as start or end of the address). Theres a special check when a "." is found thus removing "...." etc from start or end of emails.
__________________
Hey, Looking for entry level job in VB net programming. Got basic qualification, years of programming experience (mainly self taught) and good portfolio. (UK)
Reply With Quote
Reply


Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)
 
Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off

Forum Jump

Advertisement:





Free Publications
The ASP.NET 2.0 Anthology
101 Essential Tips, Tricks & Hacks - Free 156 Page Preview. Learn the most practical features and best approaches for ASP.NET.
subscribe
Programmers Heaven C# School Book -Free 338 Page eBook
The Programmers Heaven C# School book covers the .NET framework and the C# language.
subscribe
Build Your Own ASP.NET 3.5 Web Site Using C# & VB, 3rd Edition - Free 219 Page Preview!
This comprehensive step-by-step guide will help get your database-driven ASP.NET web site up and running in no time..
subscribe
Extract specific data from strings (text files)
Extract specific data from strings (text files)
Extract specific data from strings (text files) Extract specific data from strings (text files)
Extract specific data from strings (text files)
Extract specific data from strings (text files)
Extract specific data from strings (text files) Extract specific data from strings (text files) Extract specific data from strings (text files) Extract specific data from strings (text files) Extract specific data from strings (text files) Extract specific data from strings (text files) Extract specific data from strings (text files)
Extract specific data from strings (text files)
Extract specific data from strings (text files)
 
Extract specific data from strings (text files)
Extract specific data from strings (text files)
 
-->