I am trying to find a suitable way to convert any word from any language to I2OF5 Barcode.
In most basic terms, what I am thinking is constructing a numerical representation of a word by making use of ASCW function and I am planning to do that by a single pass through the variable that holds the word. Maybe I will use the first two digits for language ID and maybe the last two digits for storing the string length information in order to reduce the possibility of collisions that may appear.
Are there any other suggestions? or can you guys see any fault (fatal or obvious ;) ) in this simple approach?
10-26-2008, 08:05 PM
Using a language ID is silly*. If your text encoding is anything but ASCII and you use the same encoding to both encode and decode, you don't have to care about the language.
Having the last two digits be a length might be a bad idea. How do you know when you hit the last two digits if you don't know the length? You might be thinking that as soon as you hit a number you know you are at the length, but what if you decide to store numbers as part of the string later on? I suggest that length fields be stored first, so you know where the end of the string is.
* UNLESS your barcode format only supports characters in the standard ASCII range 0-127. That seems kind of silly though.
Text encoding is Unicode.
The Flash Cards they want to support includes languages such as Chinese.
The thing is I am developing “Flash Card Language Learning” software for a learning center that has very limited budget for the project at all, so this is simply a give away to community type of project in terms of the part I am doing. They will be spending their budget on hardware (Barcode Scanners, printer, computers, etc..).
This specific learning center has already invested in Language Learning Flash Cards. That is being said, they have an inventory of these cards.
Now, what they wish to see (I give them a priority enabled, simple “User Story” document that they filled out) is to utilize this inventory while having ability to add new Flash Cards in house (through the software). In addition to this, they wish to have a system that can support Flash Cards from any language to another language (English to Chinese, Chinese to Spanish, English to English, etc..). They want the audio feature for pronunciation, and they want to see the definition/meaning of the word that is on the Flash Card. Moreover, they want to have image representations of these words on the Flash Cards.
There is no problem with these “User Stories”. All can be implemented.
Now back to my original post:
I am putting all the Flash Cards in one table and I need to identify these Flash Cards with Barcode that is where the I2of5 comes to play. In this schema, I definitely need language IDs padded to the end of the bar code, since there are words that maps to the same barcode yet they belong to different languages and so has different meanings. More clearly,the Barcode field is unique.
Soba (Turkish) means heater
Soba (English) means a Japanese noodle made from something I don’t know.
The idea of putting length information, I don’t know what I was thinking, when I think about it now, I don’t see my point ; )
Anyway, so to construct unique I2of5s for the words, as I said before I am scanning through the word, getting ASCW values for each character and building the barcode by string concatenation. At the end I am ending up with very long unique barcode strings (for a specific language), which is not good (because I am going to print these barcodes on Avery 5160 papers). So I am passing these unique barcode strings to .NET’s string getHashCode method which reduces the length of my barcode strings to acceptable lengths. I am adding a sign char (0 for negative, 1 for positive) to that string and padding it with the language ID.
I made a unit test with 100,000 words, and didn’t see any collisions there. If I only use getHasCode without using ASCW for building a unique Barcode, I ended up 2 collisions in a set of 100,000 words.
But as for the how getHashCode works, there will be collisions (It is a Hash function after all), so I will be implementing a mechanism that will be handling these collisions.
So again to my question, now knowing what the developer’s story is, can you guys think of more reliable method (or pattern ? ) to handle this issue of mapping words to I2of5 codes.
Thanks for your time..
BTW, I2of5 is a numbers only barcode.
10-29-2008, 03:11 PM
Ouch, numbers only. That would definitely make storing the bytes of the word itself cumbersome.
"Unicode" is not an encoding, it is a method for encoding. The actual encodings are UTF-8, UTF-16, UTF-32, and possibly others. I'm going to suppose UTF-8 because it's a fairly common encoding (it looks identical to ASCII if there are no characters above position 127.) It sounds like you're going to too much trouble to get the bytes out of the string. Instead of a character-by-character AscW method, use methods in System.Text to do it all in one fell swoop. Let's say you make a word ID by appending "|" followed by the language. You can get the bytes for this word like so:
Dim wordId As String = "Dog|English"
Dim bytes() As Byte = System.Text.Encoding.UTF8.GetBytes(wordId)
Still, representing this in a digits-only barcode would consume a lot of space, as you probably noticed. Since a byte can be 0-255, you need 3 digits per character. The example above translates to this:
All that for a 3-letter word! Things might get worse as other languages require multiple bytes per character. The number of "1" digits in that string suggests that some kind of compression algorithm might work, but then you'd have to guarantee uniqueness of the compressed value and I'm pretty sure that's out.
You are definitely right that relying on the hash code won't be sufficient; it's trying to map an infinite space to a 32-bit integer, which means that collisions won't necessarily be rare.
I don't know how workable this solution may be, but I do have an idea. Storing the word is expensive, but it really sounds like what you need is a database that relates words and languages.
There's one point on which I'm not clear: do you need to know the foreign word or can it suffice to have the barcode point out that this is <language>'s version of a native word?
If it would be OK to use the same word and differentiate by language, you could get by with two simple database tables:
Id | Language Id | Word
0 | English 0 | dog
1 | Spanish 1 | house
Now, let's say we have two cards: "dog" and "perro". My limited cultural awareness tells me this is dog in English and Spanish. To generate the ID for each card, use the ID of the word and the ID of the language; preferably separate them with a character that won't be used in words like @.
So, for step 1, "dog" will be "0@0" and "perro" will be "0@1". Convert this to bytes and you'll have 048064048 and 048064049, respectively. Honestly, you could parse it more intelligently and simply ignore the "064" and encode these as 00640 and 00641 as well. This will significantly reduce the length of the barcodes. Suppose there's 50,000 words and 101 languages supported; using strings your "maximum" barcode in either format is "49999@100" -> 052057057057057064049048048 or, encoded a little smarter, 195079064064. That's a pretty big improvement. Each encoding is guaranteed to be unique because the IDs in the database are unique.
In this case, when the user wants to add a flash card, it's a simple matter to store the word in the database (and a new language if needed), then generate the barcode for the word.
If my assumption is wrong and the barcode needs to indicate the actual word in the language, you might want a slightly different database layout (but it has no effect on encoding):
Id | Word | Language Id | Language
0 | dog | 0 0 | English
1 | house | 0 1 | Spanish
2 | perro | 1
3 | casa | 1
Id | TranslatesTo
0 | 2
1 | 3
In this case, you store each word and its language in a table, and keep a translation table so you can keep track of what words translate to other words. In this case, we see that words 0 and 2 (dog and perro) are the same word in Spanish and English.
Encoding is a little easier this way and it might be superior to the last; in this case, there's no need to encode the language as part of the string because there's already a link between the word and its language in the database. In this case, you could encode "dog" and "perro" as follows:
string to byte: "0" -> 048, "2" -> 050
id to byte: 000, 002
In the same large case as before, suppose there's 50,000 words and 101 languages; the "last" word would be:
string to byte: "49999" -> 052057057057057
id to byte: 195079
When you read a barcode in this case, your program will look in the database for the word with the appropriate ID, then it can determine what language the word is in and what words are translations in other languages. You know the word ID is unique because the database enforces it.
When the user adds a barcode in this case, they add the word and what language it's in, then pick words that it translates to. This information is entered into the database.
I'm not saying this is the only way, but it does seem like a pretty reliable way to keep the size of the database down. Numbers around 4 billion encode into a 32-bit integer, which is 4 bytes, which would be 12 digits in this representation; so long as you don't plan on supporting more than 4 billion words I doubt your barcodes would get too long!