Tuesday, September 18, 2007

Pseudo Language


I'm currently converting a substantial application to be Internationalized and Localized. This basically means we're completely abstracting the language from the application, so we can easily have French, German, Italian, and other languages displayed in our application.

In addition to internationalization, we also do localization, which means the way we format dates, times, currency and other "things" are specific to a given locale. For example, there are French speaking people in both Canada and France, but the way they represent numbers, dates, times, and currency can drastically vary, so it's important to have formatting specific to both locales.

To help us with this, we have a bunch of cool, nifty tools. One of the tools we have is some code that generates a pseudo language. The pseudo language is basically a fake language that reads like English, but looks obviously different. The pseudo strings end up looking sort of like the original strings, but with heavily modified characters, and additional padding on the beginning and end of the strings. For example, the string:

The quick brown fox jumps over the lazy dog.

...becomes:

[!!! Ťħĕ qǔĭčĸ þřŏŵʼn ƒŏχ ĵǔmpš ŏvĕř ťħĕ ľäžЎ đŏğ. !!!]

The idea behind the additional length (which, by the way, is variable depending upon the length of the original string) is to ensure there's enough room on the forms in case the translation ends up being substantially longer than the original. The idea behind the modified language is to have some obvious way of determining that our application isn't loading English strings (which, for reasons that shall remain unexplained, can be sort of a pain in the butt on .NET). It's also nice because it gives a visible indicator of places that aren't being translated.

One issue we quickly found, however, was that our pseudo translator was rather stupid. The following string, which contains .NET String.Format arguments:

This is a string with String.Format parameters: {0:D}

...would end up looking something like:

[!!! Ťħĭš ĭš ä šťřĭʼnğ ŵĭťħ Šťřĭʼnğ.Fŏřmäť päřämĕťĕřš: {0:Đ} !!!]

The issue with this, of course, is that {0:Đ} doesn't mean anything to String.Format--{0:D} means something to String.Format. So, we had to modify the code to be aware of {} blocks and not attempt to translate the contents. In theory, this sounded straight forward, but it ended up being a bit involved. One thing is that "{{" and "}}" is how String.Format escapes the "{" and "}" characters, so a perfectly valid String.Format statement is:

This {{{0}}} is a {{ totally valid {{ String.Format statement.

...or maybe:

This is yet another perfectly valid String.Format argument: {{{0:D}{{Hello World!}}

...in the first, String.Format produces "This {0} is a { totally valid { String.Format statement." (the "0" can end up being any number of things, but the point is it's going to be wrapped in {} in the output) The second ends up looking like "This is yet another perfectly valid String.Format argument:{0:D{Hello World!}" So for inner {}, String.Format interprets those as places for format arguments. Any outside {{ or }} get interpreted as braces in the resulting string.

The solution I used kept track of the previous character. If I encountered {{ or }}, I reset the previous character. Any single instance of { encountered resulted in no translation until a } was encountered.

No comments: