ANSI - UTF8 issue
ANSI - UTF8 issue
ANSI - UTF8 issue
ANSI - UTF8 issue
ANSI - UTF8 issue
ANSI - UTF8 issue ANSI - UTF8 issue ANSI - UTF8 issue ANSI - UTF8 issue ANSI - UTF8 issue ANSI - UTF8 issue ANSI - UTF8 issue ANSI - UTF8 issue
ANSI - UTF8 issue ANSI - UTF8 issue
ANSI - UTF8 issue
Go Back  Xtreme Visual Basic Talk > > > ANSI - UTF8 issue


Reply
 
Thread Tools Display Modes
  #1  
Old 04-25-2008, 11:18 PM
martonx martonx is offline
Centurion
 
Join Date: Jan 2007
Location: Szeged, Hungary
Posts: 129
Exclamation ANSI - UTF8 issue


Hi Everybody,

I made a little .txt to .csv converter program in VB2008. I have serious problem with character encoding. In Hungary we use a lot of national specific letter.
My source .txt is in ANSI, if i read the stream, i get wrong UTF8 chars, or wrong UTF8 strings. For Example a little rectangle instead of character 'á'.
I also tried to read the stream into a simple int array, and made a convert int to char (chrW), i got the same false result...
How can i convert ANSI to UTF8 with good result? Or how can i redefine the string to ANSI instead of default UTF8?

Please help me! This is very important, and i can't change the .txt source, because we ge this from another country.

I attached an example .txt file, with a lot of hungarian specific ANSI characters.
Attached Files
File Type: txt pelda.txt (23 Bytes, 6 views)
Reply With Quote
  #2  
Old 04-26-2008, 01:31 AM
Csharp's Avatar
CsharpANSI - UTF8 issue Csharp is offline
Senior Contributor

* Expert *
 
Join Date: Jul 2003
Location: Ashby, Leicestershire.
Posts: 967
Default

try using Default as the encoding. eg:
Code:
        Dim sr As New StreamReader("C:\pelda.txt", System.Text.Encoding.Default)
        Console.WriteLine(sr.ReadToEnd)
        sr.Close()
__________________
~~ please don't PM me regarding code, I only reply to personnal messages ~~
Reply With Quote
  #3  
Old 04-26-2008, 09:42 AM
AtmaWeapon's Avatar
AtmaWeaponANSI - UTF8 issue AtmaWeapon is offline
Fabulous Florist

Forum Leader
* Guru *
 
Join Date: Feb 2004
Location: Austin, TX
Posts: 9,500
Default

You're not going to get consistent results converting from ANSI to UTF-8 when using any characters outside the "standard" ASCII set (0-127). ANSI depends on the code page that is in use, UTF-8 depends on the Unicode code points. If the code page for your culture maps a character to a point that is not the same as the Unicode code point (which is almost guaranteed), then you're going to have oddball characters.

You can try to use specific encodings:
Code:
Imports System.IO
Imports System.Text

Module Module1

    Sub Main()
        Dim fileContents As String

        Using reader As New StreamReader("pelda.txt", Encoding.ASCII)
            fileContents = reader.ReadToEnd()
        End Using

        ' now let's try writing
        Using writer As New StreamWriter("pelda_new.txt", False, Encoding.UTF8)
            writer.Write(fileContents)
        End Using
    End Sub

End Module
But it may or may not work based on your system code page. I'm using whatever latin code-page corresponds to EN-US and all I get is question marks when I read your file. However, you might want to see what happens if you treat the file as UTF-8 by default; notepad displays the original file with the proper characters, so I'm thinking if you ignore the fact that it's ASCII to begin with things might work.

You might be able to play around with the encoding like CSharp said, but the easiest way is to use some form of Unicode encoding for your text files; the world agreed upon this standard something like a decade ago.
__________________
.NET Resources
My FAQ threads | Tutor's Corner | Code Library
I would bet money 2/3 of .NET questions are already answered in one of these three places.
Reply With Quote
  #4  
Old 04-26-2008, 10:14 AM
AtmaWeapon's Avatar
AtmaWeaponANSI - UTF8 issue AtmaWeapon is offline
Fabulous Florist

Forum Leader
* Guru *
 
Join Date: Feb 2004
Location: Austin, TX
Posts: 9,500
Default

I knew there had to be a way to deal with the code page, so I managed to dig it up:
Code:
Imports System.IO
Imports System.Globalization
Imports System.Text

Module Module1

    Sub Main()
        ' First, we need to get the code page for the Swedish culture
        Dim culture As CultureInfo = New CultureInfo("sv-SE")
        Dim codePage As Integer = culture.TextInfo.ANSICodePage

        ' Now we need to make an encoding based on this Swedish code page
        Dim swedenEncoding As Encoding = Encoding.GetEncoding(codePage)

        ' Now, we'll open the first file and save it as UTF-8 to demonstrate it works
        Dim fileContents As String

        Using reader As New StreamReader("pelda.txt", swedenEncoding)
            fileContents = reader.ReadToEnd()
        End Using

        Using writer As New StreamWriter("pelda_new.txt", False, Encoding.UTF8)
            writer.Write(fileContents)
        End Using
    End Sub

End Module
This even works on my machine using English code pages, so I'm certain it will work. My first post was more gloom and doom because I didn't see an easy way to set the code page of an encoding.
Reply With Quote
  #5  
Old 04-27-2008, 10:56 PM
martonx martonx is offline
Centurion
 
Join Date: Jan 2007
Location: Szeged, Hungary
Posts: 129
Default

Thank you very much!

I get the original .txt file from PSA France, so i haven't options about that.

CSharp solution is working (AtmaWeapon's too), and i learned a lot from AtmaWeapon!

I hope this hread will be good for others.
Reply With Quote
  #6  
Old 04-28-2008, 01:19 PM
AtmaWeapon's Avatar
AtmaWeaponANSI - UTF8 issue AtmaWeapon is offline
Fabulous Florist

Forum Leader
* Guru *
 
Join Date: Feb 2004
Location: Austin, TX
Posts: 9,500
Default

There's a very subtle difference between CSharp's code and mine. Neither is "better", but you have to understand this difference.

CSharp's code uses the code page installed on the current system. If your current system is set to use the Swedish code page, you're set. If your current system is set to use the English code page, you might have troubles. The only problem with CSharp's code is that it only works when Windows is configured for Swedish.

My code manually declares that it will use the code page for Swedish. Even if your PC is set to use Japanese, it will do the right thing when it converts a Swedish ANSI file to UTF-8. It is inflexible, though. If you want the culture to change based on the current PC, you have to recompile.

Neither is better because both are incorrect in certain scenarios.
Reply With Quote
Reply


Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)
 
Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off

Forum Jump

Advertisement:





Free Publications
The ASP.NET 2.0 Anthology
101 Essential Tips, Tricks & Hacks - Free 156 Page Preview. Learn the most practical features and best approaches for ASP.NET.
subscribe
Programmers Heaven C# School Book -Free 338 Page eBook
The Programmers Heaven C# School book covers the .NET framework and the C# language.
subscribe
Build Your Own ASP.NET 3.5 Web Site Using C# & VB, 3rd Edition - Free 219 Page Preview!
This comprehensive step-by-step guide will help get your database-driven ASP.NET web site up and running in no time..
subscribe
ANSI - UTF8 issue
ANSI - UTF8 issue
ANSI - UTF8 issue ANSI - UTF8 issue
ANSI - UTF8 issue
ANSI - UTF8 issue
ANSI - UTF8 issue ANSI - UTF8 issue ANSI - UTF8 issue ANSI - UTF8 issue ANSI - UTF8 issue ANSI - UTF8 issue ANSI - UTF8 issue
ANSI - UTF8 issue
ANSI - UTF8 issue
 
ANSI - UTF8 issue
ANSI - UTF8 issue
 
-->