Go Back  Xtreme Visual Basic Talk > Visual Basic .NET (2002/2003/2005/2008, including Express editions) > .NET General > Replacing Strings Across Multiple Lines


Reply
 
Thread Tools Display Modes
  #1  
Old 08-18-2007, 01:25 PM
JaredHess JaredHess is offline
Regular
 
Join Date: Jul 2003
Posts: 81
Default Replacing Strings Across Multiple Lines

I'm working on creating a simple custom in-house application that reads through any .htm files in a chosen directory and its subdirectories and removes any size attributes to image tags that may exist so that the image will display it its native size and not be constrained to a specific size from the html code.

I've build the application to the point where I'm able to loop through thousands of .htm files and read each .htm file's contents into a string variable. There are usually around 3000 .htm files in the target folder.

However, I am not terribly familiar with Regular Expressions to begin with, but I've done some with Perl before, so I understand the concept even if it does pull my brain in uncomfortable directions. However, I've looked through the help files Microsoft offers and am confused on how to accomplish this in VB.NET.

Here's an example of an image tag that needs to have its sizing elements removed.

Code:
<p class=Caption
	style="margin-left: 40px;">&nbsp;<img src="5_file_adv_files/dialog_iges_file.gif"
											alt="Finestra di dialogo File IGES"
											x-maintain-ratio=TRUE
											style="border: none;
													width: 262px;
													height: 201px;
													float: none;
													border-style: none;"
											width=262
											height=201
											border=0><br> Finestra di dialogo File IGES</p>
I want to completely remove these items:

width: 262px;
height: 201px;


and these items:

width=262
height=201


... but keep everything else. Now imagine thousands of these code blocks where the intervening text, whitespace, and carriage returns placement can differ from .htm to .htm. (I actually tried to find a way to do this with an InStr / Substring / Replace approach but that ended up WAY too messy and resulted in grabbing incorrect stuff anyway.) So, I'm guessing regular expressions are the way to go.

Any ideas on how I could do this with regular expressions?
Reply With Quote
  #2  
Old 08-18-2007, 03:10 PM
Deadalus Deadalus is offline
Promising Talent

Retired Moderator
* Guru *
 
Join Date: May 2002
Location: Brussels
Posts: 3,600
Default

You're right, it's a job for regular expressions. If you're relatively new to them or their use in .NET, you'll have to read up: http://msdn2.microsoft.com/en-us/lib...12(vs.80).aspx.

I think your particular challenge shouldn't be too hard, if you use backreferences and treat the four things to find (width and height attributes and width and height in the style attribute) separately. The following code manages to remove width and height as attributes with your example as input (the string original). The rest - and fine-tuning these to work for all your files (e.g. think of extra whitespace) is up to you.
Code:
Dim modified As String
Dim findWidth As String = "(?<tokeep><img[^>]*)(width)=\d+(em|px)?\s*"
Dim findHeight As String = "(?<tokeep><img[^>]*)(height)=\d+(em|px)?\s*"
modified = Regex.Replace(original, findWidth, "${tokeep}")
modified = Regex.Replace(modified, findHeight, "${tokeep}")
Edit: Having thought about it some more, I would simplify the expressions and make use of the rules of html. For the attributes, we can remove anything from 'width'/'height' to the next (series of) whitespace. No need to specify the digits (/d+) and trailing indications like em and px. Similarly, in the style attribute you can remove anything from 'width' upto the next semicolon, plus trailing whitespace.
__________________
I do not endorse any advertisements that appear in my contribution and detest their placement against my will.

Last edited by Deadalus; 08-18-2007 at 04:38 PM.
Reply With Quote
  #3  
Old 08-31-2007, 02:51 PM
JaredHess JaredHess is offline
Regular
 
Join Date: Jul 2003
Posts: 81
Question

Deadalus, thanks for the response. I appreciate it. Using your example I was able to come up with a way of replacing these items in all my .htms and rewriting the information back into each .htm by using this code:

Code:
Sub ProcessFile(ByVal strCurrFile As String,ByVal i as Integer) ' opens file for read and writing. Dim strCurrFileText As String ' puts all file text inside one variable strCurrFileText = My.Computer.FileSystem.ReadAllText(strCurrFile) ' Use these regex for normal image size attributes '(?<tokeep><img[^>]*)(width=[^\s]*) ' I need to look for this type too. '<p class=Figure><span class=InlineFigures><img src="14_auto_files/Af_Corner_Point.gif" usemap="#Map" style="width:418px; height:368px;" width= "418" height=" 368" border="0" class="hcp1"> </span></p> ' (?<tokeep><img[^>]*)(width)[^=]*=[^\d]*\d+[^\s]* Dim strModified As String Dim strFindWidth As String = "(?<tokeep><img[^>]*)(width[^=]*=[^\d]*\d+[^\s]*)" Dim strFindHeight As String = "(?<tokeep><img[^>]*)(height[^=]*=[^\d]*\d+[^\s]*)" ' Use these regex statements for pixel style size attributes '(?<tokeep><img[^>]*)(width:[^;]*;) '(?<tokeep><img[^>]*)(height:[^;]*;) Dim strFindWidth2 As String = "(?<tokeep><img[^>]*)(width:[^;]*;)" Dim strFindHeight2 As String = "(?<tokeep><img[^>]*)(height:[^;]*;)" ' Count the matches Dim MyMatches As MatchCollection Dim cnt1, cnt2, cnt3, cnt4 As Integer cnt1 = 0 cnt2 = 0 cnt3 = 0 cnt4 = 0 ' Counts up number of changes in project MyMatches = Regex.Matches(strCurrFileText, strFindWidth, RegexOptions.IgnoreCase Or RegexOptions.Singleline) cnt1 = MyMatches.Count MyMatches = Regex.Matches(strCurrFileText, strFindHeight, RegexOptions.IgnoreCase Or RegexOptions.Singleline) cnt2 = MyMatches.Count MyMatches = Regex.Matches(strCurrFileText, strFindWidth2, RegexOptions.IgnoreCase Or RegexOptions.Singleline) cnt3 = MyMatches.Count MyMatches = Regex.Matches(strCurrFileText, strFindHeight2, RegexOptions.IgnoreCase Or RegexOptions.Singleline) cnt4 = MyMatches.Count cntTotal = cnt1 + cnt2 + cnt3 + cnt4 If cntTotal > 0 Then ' Change the string using the regex codes strModified = Regex.Replace(strCurrFileText, strFindWidth, "${tokeep}", RegexOptions.Singleline Or RegexOptions.IgnoreCase) strModified = Regex.Replace(strModified, strFindHeight, "${tokeep}", RegexOptions.Singleline Or RegexOptions.IgnoreCase) strModified = Regex.Replace(strModified, strFindWidth2, "${tokeep}", RegexOptions.Singleline Or RegexOptions.IgnoreCase) strModified = Regex.Replace(strModified, strFindHeight2, "${tokeep}", RegexOptions.Singleline Or RegexOptions.IgnoreCase) 'This writes back into file overwriting what's there. My.Computer.FileSystem.WriteAllText(strCurrFile, strModified, False) My.Computer.FileSystem.WriteAllText(My.Computer.FileSystem.CurrentDirectory & "\" & "resizelog.txt", vbCrLf & "File Processed: " & strCurrFile, True) intAffectedPages = intAffectedPages + 1 End If End Sub

I'm running into one additional small problem however. It seems that after doing the regex replacements and writing the info back into the file, this string of strange characters is appearing at the very beginning of each .htm file that was rewritten:



It's not a huge problem, but it looks unprofessional in the final product.

The first few lines of code in an .htm file where this is happening looks like this:
Code:
<!doctype HTML public "-//W3C//DTD HTML 4.0 Frameset//EN">

<html>

<head>
Does anyone have any ideas on why these characters appear and how I can get rid of them? Am I doing something wrong when writing back to the file?
Reply With Quote
  #4  
Old 08-31-2007, 03:28 PM
AtmaWeapon's Avatar
AtmaWeapon AtmaWeapon is online now
Ultimate Contributor

Forum Leader
* Guru *
 
Join Date: Feb 2004
Location: Austin, TX
Posts: 7,598
Default

If my hex editor isn't lying, those are byte order marks and text editors are supposed to throw such characters away.

This is the perfect example of what happens when power vs. ease of use decisions must be made. First, some background reading:
http://en.wikipedia.org/wiki/Byte_Order_Mark

The three characters you show have hex values EF BB BF, which you can see indicates the file is encoded in UTF-8.

Why I say this is a power vs. ease-of-use thing is because you opted to use the easiest overload, which is implemented as follows:
Code:
Public Shared Sub WriteAllText(ByVal file As String, ByVal [text] As String, ByVal append As Boolean)
    FileSystem.WriteAllText(file, [text], append, Encoding.UTF8)
End Sub
The encoding represented by Encoding.UTF8 prepends its BOM by default:
Code:
Public Shared ReadOnly Property UTF8 As Encoding
    Get
        If (Encoding.utf8Encoding Is Nothing) Then
            Encoding.utf8Encoding = New UTF8Encoding(True)
        End If
        Return Encoding.utf8Encoding
    End Get
End Property
This is fine in most cases and any text editor that supports UTF-8 should be able to handle this with no problems. In fact I'm curious what editor you are using that displays the characters as I just tested it in Visual Studio, gVim, and Notepad and all handled it appropriately. If you wish to fix it, you have to sacrifice some ease-of-use for clarity and declare an Encoding that you wish to use:
Code:
        Dim textEncoding As System.Text.Encoding = New System.Text.UTF8Encoding(False)

        Dim blah As String = "test test"

        My.Computer.FileSystem.WriteAllText("test.txt", blah, False, textEncoding)
__________________
.NET Resources
My FAQ threads | Tutor's Corner | Code Library
I would bet money 2/3 of .NET questions are already answered in one of these three places.
Reply With Quote
  #5  
Old 08-31-2007, 04:02 PM
JaredHess JaredHess is offline
Regular
 
Join Date: Jul 2003
Posts: 81
Default

Thanks AtmaWeapon,

I figured after looking at this some more that it had something to do with character encoding or something but wasn't sure what. This entire issue is very much a mystery to me.

I've checked and you are correct, inside NotePad and NotePad++ these chars do not appear. I didn't think to look in those applications.

I was viewing the finalized .htm files inside the FireFox browser which does display these and when I selected View | Code from within Firefox they also appeared. Internet Explorer on the other hand does not display these.

I assumed they would appear in every text editor. This would explain why when I tried to do a find and replace inside NotePad++ it didn't find them at all...

I'll try to implement what you suggested. Thanks for the tips.
Reply With Quote
  #6  
Old 08-31-2007, 04:16 PM
AtmaWeapon's Avatar
AtmaWeapon AtmaWeapon is online now
Ultimate Contributor

Forum Leader
* Guru *
 
Join Date: Feb 2004
Location: Austin, TX
Posts: 7,598
Default

I think the issue there is because Firefox seems to use ISO encoding by default, which is kind of odd given the project's air towards compatibility. If you look in View>Encoding you can change how Firefox interprets the page and that might make it go away, but of course this is non-default.
__________________
.NET Resources
My FAQ threads | Tutor's Corner | Code Library
I would bet money 2/3 of .NET questions are already answered in one of these three places.
Reply With Quote
  #7  
Old 05-27-2008, 04:11 PM
JaredHess JaredHess is offline
Regular
 
Join Date: Jul 2003
Posts: 81
Default

I implemented AtmaWeapon's suggestion and have been merrily using my little program for a few months now, but have run into a problem when running it to handle a directory of Chinese translated .htm files on my English computer (Vista OS). I think the character encoding in my program is messing up the output when I rewrite the file.

When I rewrite the file, how would I make the character encoding use the same character set as defined in the .htm file?

Here's a pic of what a typical file looks like before the file was rewritten by my program (these are the correct Chinese chars):
http://www.wilcoxassoc.com/Testing/i...inese_good.gif

Here's a pic of what the file looks like after running the program. Notice how the characters seem to be repeating themselves for some reason. (These chars are incorrect):
http://www.wilcoxassoc.com/Testing/i...hinese_bad.gif

The underlying characters in the HTM source are not the same after I run the program. I'm pretty sure this has to do with this encoding issue.

(also note that it isn't just Chinese that's having problems... in Spanish it's doing something related--albeit considerably less heinous--where it's throwing in some miscellaneous chars for alphabet chars containing accent marks, tildes, etc)

Many thanks in advance for any suggestions you can offer!
Reply With Quote
  #8  
Old 05-27-2008, 08:01 PM
AtmaWeapon's Avatar
AtmaWeapon AtmaWeapon is online now
Ultimate Contributor

Forum Leader
* Guru *
 
Join Date: Feb 2004
Location: Austin, TX
Posts: 7,598
Default

The pictures aren't really worth much to me; to get to the unicode values I'd have to open up a table of all the asian characters and compare; it'd take me probably an hour to look at three or four characters.

If you could post just the first few lines of a "before" and "after" text files, I bet I could figure out what's going on. My first guess is these Chinese files are not stored in UTF8, so the encoding process is messing them up. I have no idea how I think I'm going to figure out what the encoding is, but the pictures don't do a lot of good for me because I'm not familiar with Chinese characters.
__________________
.NET Resources
My FAQ threads | Tutor's Corner | Code Library
I would bet money 2/3 of .NET questions are already answered in one of these three places.
Reply With Quote
  #9  
Old 05-28-2008, 12:36 PM
JaredHess JaredHess is offline
Regular
 
Join Date: Jul 2003
Posts: 81
Default

Thanks for looking at it AtmaWeapon. Here are a few lines from each .htm file once I opened them in NotePad++.

Before
Code:
<!doctype HTML public "-//W3C//DTD HTML 4.0 Frameset//EN">

<html>

<head>
<title>&#181;&#188;&#200;&#235;IGES&#206;&#196;&#188;&#254;</title>
<meta http-equiv="Content-Type" content="text/html; charset=GB2312">
<meta name="generator" content="RoboHelp by eHelp Corporation www.ehelp.com">
<link rel="stylesheet" href="../Pcdmis40.css"><style>
<!--
A:visited { color:#800080; }
A:link { color:#0000ff; }
-->
</style><script type="text/javascript" src="../roadmap.js" language="JavaScript1.2"></script>

<style title="hcp" type="text/css">
<!--
ol.hcp1 { list-style:decimal; }
a.hcp2 { x-condition:Online; }
span.hcp3 { layout-grid-mode:line; }
span.hcp4 { font-family:Arial; }
-->
</style>
</head>
<body lang="EN-US"><script type="text/javascript" language="JavaScript1.2" x-save-method="compute-relative" src="../ehlpdhtm.js"></script>
<script type="text/javascript"
		language=JavaScript1.2>
<!-- 
if( typeof( kadovFilePopupInit ) != 'function' ) kadovFilePopupInit = new Function();if( typeof( kadovTextPopupInit ) != 'function' ) kadovTextPopupInit = new Function();
 //-->
</script>

<div class=x-popup-text id=POPUP200707683  style='display: none; position: absolute' >
<p>&#193;&#227;&#188;&#254;&#179;&#204;&#208;&#242;&#202;&#199;&#182;&#212;&#178;&#226;&#193;&#191;&#186;&#205;&#188;&#236;&#178;&#226;&#181;&#196;&#206;&#196;&#215;&#214;&#195;&#232;&#202;&#246;&#161;&#163; &#195;&#191;&#184;&#246;&#193;&#227;&#188;&#254;&#179;&#204;&#208;&#242;&#182;&#188;&#211;&#208;&#206;&#168;&#210;&#187;&#181;&#196;&#195;&#251;&#179;&#198;&#163;&#172;&#192;&#169;&#213;&#185;&#195;&#251;&#206;&#170; .prg&#163;&#172; &#193;&#227;&#188;&#254;&#179;&#204;&#208;&#242;&#211;&#201;&#200;&#253;&#215;&#248;&#177;&#234;&#178;&#226;&#193;&#191;&#187;&#250;&#178;&#217;&#215;&#247;&#213;&#223;&#180;&#180;&#189;&#168;&#161;&#163; &#200;&#231;&#185;&#251;&#211;&#235; CAD &#196;&#163;&#208;&#205;&#207;&#224;&#185;&#216;&#193;&#170;&#163;&#172;CAD &#206;&#196;&#188;&#254;&#189;&#171;&#211;&#235;&#210;&#212; .CAD &#206;&#170;&#192;&#169;&#213;&#185;&#195;&#251;&#181;&#196;&#193;&#227;&#188;&#254;&#179;&#204;&#208;&#242;&#206;&#196;&#188;&#254;&#205;&#172;&#195;&#251;&#161;&#163; &#200;&#231;&#185;&#251;&#211;&#235; CAD &#196;&#163;&#208;&#205;&#207;&#224;&#185;&#216;&#193;&#170;&#163;&#172;CAD &#206;&#196;&#188;&#254;&#189;&#171;&#211;&#235;&#210;&#212; .CAD &#206;&#170;&#192;&#169;&#213;&#185;&#195;&#251;&#181;&#196;&#193;&#227;&#188;&#254;&#179;&#204;&#208;&#242;&#206;&#196;&#188;&#254;&#205;&#172;&#195;&#251;&#161;&#163;</p>
</div>

After
Code:
<!doctype HTML public "-//W3C//DTD HTML 4.0 Frameset//EN">

<html>

<head>
<title>&#239;&#191;&#189;&#239;&#191;&#189;&#239;&#191;&#189;&#239;&#191;&#189;IGES&#239;&#191;&#189;&#196;&#188;&#239;&#191;&#189;</title>
<meta http-equiv="Content-Type" content="text/html; charset=GB2312">
<meta name="generator" content="RoboHelp by eHelp Corporation www.ehelp.com">
<link rel="stylesheet" href="../Pcdmis40.css"><style>
<!--
A:visited { color:#800080; }
A:link { color:#0000ff; }
-->
</style><script type="text/javascript" src="../roadmap.js" language="JavaScript1.2"></script>

<style title="hcp" type="text/css">
<!--
ol.hcp1 { list-style:decimal; }
a.hcp2 { x-condition:Online; }
span.hcp3 { layout-grid-mode:line; }
span.hcp4 { font-family:Arial; }
-->
</style>
</head>
<body lang="EN-US"><script type="text/javascript" language="JavaScript1.2" x-save-method="compute-relative" src="../ehlpdhtm.js"></script>
<script type="text/javascript"
		language=JavaScript1.2>
<!-- 
if( typeof( kadovFilePopupInit ) != 'function' ) kadovFilePopupInit = new Function();if( typeof( kadovTextPopupInit ) != 'function' ) kadovTextPopupInit = new Function();
 //-->
</script>

<div class=x-popup-text id=POPUP200707683  style='display: none; position: absolute' >
<p>&#239;&#191;&#189;&#239;&#191;&#189;&#239;&#191;&#189;&#239;&#191;&#189;&#239;&#191;&#189;&#239;&#191;&#189;&#239;&#191;&#189;&#239;&#191;&#189;&#199;&#182;&#212;&#178;&#239;&#191;&#189;&#239;&#191;&#189;&#239;&#191;&#189;&#239;&#191;&#189;&#205;&#188;&#239;&#191;&#189;&#239;&#191;&#189;&#239;&#191;&#189;&#239;&#191;&#189;&#239;&#191;&#189;&#239;&#191;&#189;&#239;&#191;&#189;&#239;&#191;&#189;&#239;&#191;&#189;&#239;&#191;&#189;&#239;&#191;&#189;&#239;&#191;&#189;&#239;&#191;&#189; &#195;&#191;&#239;&#191;&#189;&#239;&#191;&#189;&#239;&#191;&#189;&#239;&#191;&#189;&#239;&#191;&#189;&#239;&#191;&#189;&#239;&#191;&#189;&#239;&#191;&#189;&#239;&#191;&#189;&#239;&#191;&#189;&#239;&#191;&#189;&#206;&#168;&#210;&#187;&#239;&#191;&#189;&#239;&#191;&#189;&#239;&#191;&#189;&#239;&#191;&#189;&#239;&#191;&#189;&#198;&#163;&#239;&#191;&#189;&#239;&#191;&#189;&#239;&#191;&#189;&#213;&#185;&#239;&#191;&#189;&#239;&#191;&#189;&#206;&#170; .prg&#239;&#191;&#189;&#239;&#191;&#189; &#239;&#191;&#189;&#239;&#191;&#189;&#239;&#191;&#189;&#239;&#191;&#189;&#239;&#191;&#189;&#239;&#191;&#189;&#239;&#191;&#189;&#239;&#191;&#189;&#239;&#191;&#189;&#239;&#191;&#189;&#239;&#191;&#189;&#239;&#191;&#189;&#239;&#191;&#189;&#239;&#191;&#189;&#239;&#191;&#189;&#239;&#191;&#189;&#239;&#191;&#189;&#239;&#191;&#189;&#239;&#191;&#189;&#239;&#191;&#189;&#239;&#191;&#189;&#239;&#191;&#189;&#239;&#191;&#189;&#239;&#191;&#189;&#239;&#191;&#189;&#223;&#180;&#239;&#191;&#189;&#239;&#191;&#189;&#239;&#191;&#189;&#239;&#191;&#189;&#239;&#191;&#189; &#239;&#191;&#189;&#239;&#191;&#189;&#239;&#191;&#189;&#239;&#191;&#189;&#239;&#191;&#189; CAD &#196;&#163;&#239;&#191;&#189;&#239;&#191;&#189;&#239;&#191;&#189;&#239;&#191;&#189;&#239;&#191;&#189;&#239;&#191;&#189;&#239;&#191;&#189;&#239;&#191;&#189;&#239;&#191;&#189;CAD &#239;&#191;&#189;&#196;&#188;&#239;&#191;&#189;&#239;&#191;&#189;&#239;&#191;&#189;&#239;&#191;&#189;&#239;&#191;&#189;&#239;&#191;&#189;&#239;&#191;&#189; .CAD &#206;&#170;&#239;&#191;&#189;&#239;&#191;&#189;&#213;&#185;&#239;&#191;&#189;&#239;&#191;&#189;&#239;&#191;&#189;&#239;&#191;&#189;&#239;&#191;&#189;&#239;&#191;&#189;&#239;&#191;&#189;&#239;&#191;&#189;&#239;&#191;&#189;&#239;&#191;&#189;&#239;&#191;&#189;&#239;&#191;&#189;&#196;&#188;&#239;&#191;&#189;&#205;&#172;&#239;&#191;&#189;&#239;&#191;&#189;&#239;&#191;&#189;&#239;&#191;&#189; &#239;&#191;&#189;&#239;&#191;&#189;&#239;&#191;&#189;&#239;&#191;&#189;&#239;&#191;&#189; CAD &#196;&#163;&#239;&#191;&#189;&#239;&#191;&#189;&#239;&#191;&#189;&#239;&#191;&#189;&#239;&#191;&#189;&#239;&#191;&#189;&#239;&#191;&#189;&#239;&#191;&#189;&#239;&#191;&#189;CAD &#239;&#191;&#189;&#196;&#188;&#239;&#191;&#189;&#239;&#191;&#189;&#239;&#191;&#189;&#239;&#191;&#189;&#239;&#191;&#189;&#239;&#191;&#189;&#239;&#191;&#189; .CAD &#206;&#170;&#239;&#191;&#189;&#239;&#191;&#189;&#213;&#185;&#239;&#191;&#189;&#239;&#191;&#189;&#239;&#191;&#189;&#239;&#191;&#189;&#239;&#191;&#189;&#239;&#191;&#189;&#239;&#191;&#189;&#239;&#191;&#189;&#239;&#191;&#189;&#239;&#191;&#189;&#239;&#191;&#189;&#239;&#191;&#189;&#196;&#188;&#239;&#191;&#189;&#205;&#172;&#239;&#191;&#189;&#239;&#191;&#189;&#239;&#191;&#189;&#239;&#191;&#189;</p>
</div>
If you need it, I've also posted a zip containing these full .htm files (saved as .txt files). You can get them from here:
http://www.wilcoxassoc.com/Testing/f..._txt_files.zip

Again, thanks for the help.
Reply With Quote
  #10  
Old 05-28-2008, 04:31 PM
AtmaWeapon's Avatar
AtmaWeapon AtmaWeapon is online now
Ultimate Contributor

Forum Leader
* Guru *
 
Join Date: Feb 2004
Location: Austin, TX
Posts: 7,598
Default

I'm about to leave for home, but here's some things I noticed while trying to figure out the encoding at play here. I'm mainly posting so it's like I left a notebook open for me to continue later; there's some experimental programs I want to write and this is an efficient way to propagate the links

The first apparent Chinese character is in the <title> element; I opened the files you gave in a hex editor to see how the character might be represented. The first few bytes in the good file are B5 BC C8 EB; the first few in the bad file are EF BF BD EF. Obviously there's a change, but we knew this to begin with.

My suspicion that the first file isn't actually UTF-8 seemed to be confirmed when I checked a list of unicode characters; codepoint U+B5BC didn't look anything like the picture of the correct character you gave me though. So, on a whim, I did a Google search for "0xB5BC", the C-style representation of this hex format. The result was the documentation for some UTF-8 Perl module. The interesting part here is the hashmap assignment:
Code:
0xb5bc=>0x5BFC
Again, on a whim, I checked the Unicode character list for U+5BFC; as far as I can tell this character is identical to the first character in the page's title.

Where I'm stuck is why this translation was made, because there doesn't seem to be a clear pattern. The next character in the title is 0xC8EB; this is translated to 0x5165 in the Perl module; again this character matches according to my untrained eye. I don't understand how this translation table was generated; for all I know it's just a table that has to be known.

This means there's probably some encoding class that can do the conversion, but I'm not sure which. My guess is the original file is UTF-16, but that seems odd because the latin characters should be represented with two bytes as well (for example, the < character would be 0x003C instead of 0x3C). This is where I want to fool around with some test programs.
__________________
.NET Resources
My FAQ threads | Tutor's Corner | Code Library
I would bet money 2/3 of .NET questions are already answered in one of these three places.
Reply With Quote
  #11  
Old 05-29-2008, 12:35 PM
AtmaWeapon's Avatar
AtmaWeapon AtmaWeapon is online now
Ultimate Contributor

Forum Leader
* Guru *
 
Join Date: Feb 2004
Location: Austin, TX
Posts: 7,598
Default

Got it. To understand the problem we have to take a tour through encoding history. This is all off of the top of my head, and I in some cases I'm guessing as to why decisions were made, but the technical details should be accurate.

ASCII
In the beginning, most computers were operated in America and Europe. Americans designed the ASCII character encoding (the "A" is for "American", true story!). In this character encoding, 7 bits are used for character data. This provides ample room for the 10 digits, 52 letters, and several punctuation marks that make up common American English communications; all said there are 94 printable characters in ASCII.

ANSI
Of course, several European languages use more than the characters used by English. French scientists couldn't just use a different encoding though; if they did this then their text files would not be compatible with the text files of people using ASCII. The solution people worked out lies in the 8th bit that ASCII doesn't use. When all 8 bits are used, you get 256 total characters to work with. The decision was made that no one would touch the characters in 7-bit ASCII, but anything above 0x7F would be defined in code pages. So, if I'm using the code page for American English, I'll have a lot of funky symbols in the 0x7F to 0xFF range, but if I'm using the code page for Russian (Cyrillic?) I'll have the characters of the Russian alphabet available in this range.

A consequence of this encoding is that you have to use the same code page to view a file as the page that was used to create it; if you don't, it will look like gibberish. Another problem is some languages have far more than 256 characters, and ASCII/ANSI only use 8 bits for character data. The solution was multiple-byte character sets; for certain code pages certain codes were set aside as "marker" characters that meant the particular character required 2 or more bytes to be defined. For example, some Chinese character might be represented as the bytes 0x81 0x13, where 0x81 is some special character that says "This character is 2 bytes long, read the next one too!". This is your problem, but let's come back to it as we have to go a step farther to see what's going wrong.

Unicode and its flavors
Unicode addresses the problems with ASCII and ANSI by creating a chart of characters independent of the encodings that represent them. According to Wikipedia's Unicode article, there are 1,114,112 valid code points in the range 0x0 to 0x10FFFF. Unicode places the burden of how to represent its characters to specific encodings, here's some common ones:
  • UTF-8 is a variable-width encoding; this means that for characters outside the range that can be expressed by one byte, two or more bytes are used. You can read about how this is done, but what it basically means is you have to perform a calculation to go from a Unicode character outside of the standard ASCII range to the bytes that represent the character in UTF-8. UTF-8 is good for English text because it only uses one byte per character for characters in the ASCII range.
  • UTF-16 is a variable-width encoding that tends to be what most people associate with Unicode; it's the native text format in Windows since Windows 2000 and both Java and .NET use it internally. Every character takes at least 2 bytes to represent; characters that cannot be expressed in two bytes are expressed using surrogate pairs; hit up a reference for more information. Some people favor UTF-8 over UTF-16 because English text in UTF-16 will take twice as many bytes to represent.
  • UTF-32 is the lazy man's Unicode encoding; it always uses 4 bytes per character and thus can represent the entire Unicode character set as it exists today. In general, this encoding is very wasteful unless your text file uses characters that are in the upper ranges of Unicode code points.

Now that we have the background information out of the way, I can teach you how I used this information to guess which encoding your text files had and how to properly work with them.

I pointed out that I had already been able to convert characters from their representation in the file to their Unicode equivalents via the Perl module I happened to find. It was obvious the encoding at work was not UTF-8, since UTF-8 wasn't working and the bytes weren't correct for UTF-8 characters. It also was obviously not UTF-16 or UTF-32, because 1) there would be null bytes for the ASCII characters in the HTML tags and 2) the code point would be used without translation in either of these encodings.

This left me with ASCII/ANSI as the possible format; it seemed to fit. ASCII characters were using a single byte and had the encodings I expected, Chinese characters were using non-Unicode bytes and seemed to be using multiple bytes. The best way to find out is to test. To use a particular code page, you have to jump through a few hoops. The target is an instance of the System.Text.Encoding class that we can pass to our I/O classes to tell it what encoding to use. First, you need a System.Globalization.CultureInfo object with the correct culture; I wish I could find a list of the cultures but all I have found is comments in examples in MSDN documentation. So, if I pass "zh-CHS" to the CultureInfo.GetCultureInfo method, I get a CultureInfo instance for the Chinese (Simplified) culture; this has a TextInfo property that has a CodePage property that defines the code page I need to use. This code page can be passed to the System.Text.Encoding.GetEncoding method to get the Encoding we need. Target reached!

Below is the program I wrote to test each encoding; I renamed the good file to "good.txt". The program uses each encoding to open the good file, then write a new file using that encoding and what it read. After it ran I used a diff program (I used Compare It!, but a free alternative is WinMerge) to compare the original file to the output file for each encoding; the file generated using the ANSI code page encoding was identical.

Code:
Imports System.IO
Imports System.Text
Imports System.Globalization

Module Module1



    Sub Main()
        TryUtf8()
        TryUtf32()
        TryAscii()
        TryUnicode()
        TryWithCulture()
    End Sub

    Private Sub TryUtf8()
        Dim data As String

        Using reader As New StreamReader("good.txt", Encoding.UTF8)
            data = reader.ReadToEnd()
        End Using

        Using writer As New StreamWriter("utf8.txt", False, Encoding.Unicode)
            writer.Write(data)
        End Using
    End Sub

    Private Sub TryUtf32()
        Dim data As String

        Using reader As New StreamReader("good.txt", Encoding.UTF32)
            data = reader.ReadToEnd()
        End Using

        Using writer As New StreamWriter("utf32.txt", False, Encoding.Unicode)
            writer.Write(data)
        End Using
    End Sub

    Private Sub TryAscii()
        Dim data As String

        Using reader As New StreamReader("good.txt", Encoding.ASCII)
            data = reader.ReadToEnd()
        End Using

        Using writer As New StreamWriter("ascii.txt", False, Encoding.Unicode)
            writer.Write(data)
        End Using
    End Sub
 
    Private Sub TryUnicode()
        Dim data As String

        Using reader As New StreamReader("good.txt", Encoding.Unicode)
            data = reader.ReadToEnd()
        End Using

        Using writer As New StreamWriter("unicode.txt", False, Encoding.Unicode)
            writer.Write(data)
        End Using
    End Sub

    Private Sub TryWithCulture()
        Dim data As String
        Dim cInfo As CultureInfo
        cInfo = CultureInfo.GetCultureInfo("zh-CHS") ' Chinese (Simplified)

        Dim codePage As Integer = cInfo.TextInfo.ANSICodePage
        Dim encoding As Encoding = Text.Encoding.GetEncoding(codePage)

        Using reader As New StreamReader("good.txt", encoding)
            data = reader.ReadToEnd()
        End Using

        Using writer As New StreamWriter("ansi.txt", False, encoding)
            writer.Write(data)
        End Using
    End Sub

End Module
You might want to nudge the people generating the files into seeing if their editor lets them save files as UTF-8. Seeing as they are generating Chinese-language files that have to be modified by machines in other countries it makes the most sense to use some kind of Unicode encoding. I really wish more text editors would default to this; ASCII/ANSI is so 1998 (I have Windows programming books from this era advocating the use of Unicode over ANSI encoding!) If they do not save their files in UTF-8, then you'll be forced to have this special workaround for their files.
__________________
.NET Resources
My FAQ threads | Tutor's Corner | Code Library
I would bet money 2/3 of .NET questions are already answered in one of these three places.
Reply With Quote
  #12  
Old 06-16-2008, 01:25 PM
JaredHess JaredHess is offline
Regular
 
Join Date: Jul 2003
Posts: 81
Default

AtmaWeapon,

Just wanted to say thanks for your help on this. I've managed to get my routine working with the Chinese files.

I'm sure I speak for other as well when I say I really appreciate how you and others often take time out of what I'm sure what are busy days to offer much needed suggestions and help.

Again, thanks.

Jared
Reply With Quote
Reply


Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)
 
Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off

Forum Jump

Advertisement:

Powered by liquidweb