Sunday, October 23, 2005

i18n zip file woes

Update: This is being addressed in JDK7. It will still be up to your to figure out what encoding your zip file is in - no small feat if you've got zips coming in from various sources/locales.

Our app allows customers to upload zip files of content. Recently, customers in Brazil were having problems with a file they were uploading. We'd get this exception when trying to unzip it:

java.lang.IllegalArgumentException
at java.util.zip.ZipInputStream.getUTF8String (ZipInputStream.java:291)
at java.util.zip.ZipInputStream.readLOC (ZipInputStream.java:230)
at java.util.zip.ZipInputStream.getNextEntry (ZipInputStream.java:75)

Inspecting the file, I found they were including Portuguese characters in the file name. I was pretty suprised to find this issue was related to not one, but two of Java's Top 25 Bugs. The Sun engineers point out the unfortunate fact that the Zip spec doesn't say anything about encoding of file names. The only thing Java could do better is allow people to pass in their own encoding when instantiating a ZipFile. A commentor even provides a patch for ZipInputStream to allow this. I found this solution didn't even cut it for me. The customers were using some German zip program (a truly international problem!), and I couldn't get a clean unzip with any encoding. When I discoved both WinZip and Windows XP "compressed folders" feature were baffled as well, I threw in the towel. People waiting for Sun to fix this are probably in for a nasty suprise.


For the Brazillian customers, I had them re-create the zip using a nice little graphical jar utility - ZipOutputStream uses UTF-8, which will have no trouble with their i18n file names. Another good alternative would be 7-zip, which considered i18n issues from the start. But, suprisingly, their Java support seems pretty lacking.

For the grim details of this problem, I can't say it better than the WinZip tech support person:


>There is unfortunately some ambiguity within Zip files as to the
>character set that the filename is stored in. Whenever a file can be
>stored in the OEM character set (which is the character set originally
>used by MS-DOS), WinZip does so for compatibility with other Zip
>utilities. In this case, it marks the Zip file as made by MS-DOS, since
>the original Zip utilities were DOS utilities that used the OEM
>character set for filenames.
>
>For files whose names can't be stored in OEM, because they contain
>characters not present in the OEM character set, WinZip stores the
>filename in ANSI and marks the file as having been made on an
>NTFS-based system. WinZip is here following the lead of InfoZip, a Zip
>utility that since 1993 used this method of marking filenames as being
>stored in the ANSI character set. (ANSI is the character set used by
>Windows 95, 98, and Me, and by many Windows applications. Windows 2000
>and Windows XP use yet another character set, Unicode, but they also
>provide support for the ANSI character set.)
>
>However, some other Zip utilities handle this differently. In
>particular, Microsoft's Compressed Folders program always marks the
>files it creates as having been made on an NTFS-based system, but
>stores the filename in the OEM character set.
>
>So when WinZip sees a file marked as NTFS-based, it has no way to be
>absolutely sure whether the filename is in the OEM character set (and
>needs to be translated to ANSI), or if it is already in the ANSI
>character set. WinZip uses some complicated heuristics to try to decide
>which character set is involved, and it almost always comes up with the
>"right" answer, but there is no way to be absolutely right in every
>case.
>
>We are looking into how to handle this better in a future version of
>WinZip.



Saturday, October 22, 2005

Crystal ate my log4j!

When solving a problem on a live site, logs are indespesnable. So imagine my frustration last spring when I went to solve a problem, and no logs were there. None. Everything had just stopped logging. Solving a problem with access logs alone is not an enviable task. I searched around a bit, and found this

I was indeed using the log4j WatchDog to periodically watch log4j.xml for changes. And we do hot restarts of the context from time to time. I wasn't convinced this was the problem, as I could not reproduce the problem no matter what I did with the WatchDog and hot restarts. So I set the system property "log4j.debug" to "true", in the hopes that next time it happened, I'd get some kind of info. I also wrote code that ensured any time a Logger object is acquired, a System.out Console appender was attached to it, at the very least.

The logs continued to disappear, but thankfully output went to stdout, and was captured. Then, log4j.debug paid off. In the middle of a log, I saw this:
log4j: Reading configuration from URL jar:file:/E:/webapps/app/WEB-INF/lib/celib.jar!/META-INF/CrystalEnterprise.Trace/basic.properties
log4j: Parsing for [root] with value=[ERROR, A1].
log4j: Level token is [ERROR].
log4j: Category root set to ERROR
Whoah! Sure enough, all of my categories and appenders were blown away at this point. It didn't happen at startup, but whenever the first Crystal report was launched. After much consternation and trying to decipher the log4j comment in the release notes, my coworker came up with setting the system property "crystal.enterprise.trace" (yes, not even mentioned in the release notes) to "false". Now we can launch Crystal reports AND use log4j - oh, the decadence!

I'm no log4j expert, so I wonder how Crystal should have played more nicely here. Surely, there's a away to use PropertyConfigurator to amend logging configuration, rather then obliterate it.

Lessons learned:
1. Abstract your code such that you can intercept acquisition of Logger objects. Use this code to ensure that a Logger is at least guaranteed to have a Console appender, and escalate the problem if it indeed has zero appenders.
2. Set log4j.debug to true at all times
3. When using crystal, be sure to set crystal.enterprise.trace to true (unless, of course, you care about their logs more than your own).





Microsoft to buy Redhat?

I've finally gotten over my fear that I'd create a totally useless, misguided, and confused blog. I've gotten over it - I now know this blog will be be useless, misguided, and confused. My only three goals are:
  • Document weird problems I've encountered in the hopes that I can help someone avoid my frustration
  • Have inflamatory titles to attract people to read my blog, and respond to it.
  • Learn from responses to my posts
  • Vent.
I'm a Java programmer who has been working on roughly the same product since about 1998. That being said, I won't have many smart things to say about the hot topics of the day: AJAX, RoR, or Web 2.0. I will hopefully be able to share interesting experience on more mundane topics, like JDBC, i18n, webapp performance, maintainable code, and other such things.