Sunday, October 23, 2005

i18n zip file woes

Update: This is being addressed in JDK7. It will still be up to your to figure out what encoding your zip file is in - no small feat if you've got zips coming in from various sources/locales.

Our app allows customers to upload zip files of content. Recently, customers in Brazil were having problems with a file they were uploading. We'd get this exception when trying to unzip it:

java.lang.IllegalArgumentException
at java.util.zip.ZipInputStream.getUTF8String (ZipInputStream.java:291)
at java.util.zip.ZipInputStream.readLOC (ZipInputStream.java:230)
at java.util.zip.ZipInputStream.getNextEntry (ZipInputStream.java:75)

Inspecting the file, I found they were including Portuguese characters in the file name. I was pretty suprised to find this issue was related to not one, but two of Java's Top 25 Bugs. The Sun engineers point out the unfortunate fact that the Zip spec doesn't say anything about encoding of file names. The only thing Java could do better is allow people to pass in their own encoding when instantiating a ZipFile. A commentor even provides a patch for ZipInputStream to allow this. I found this solution didn't even cut it for me. The customers were using some German zip program (a truly international problem!), and I couldn't get a clean unzip with any encoding. When I discoved both WinZip and Windows XP "compressed folders" feature were baffled as well, I threw in the towel. People waiting for Sun to fix this are probably in for a nasty suprise.


For the Brazillian customers, I had them re-create the zip using a nice little graphical jar utility - ZipOutputStream uses UTF-8, which will have no trouble with their i18n file names. Another good alternative would be 7-zip, which considered i18n issues from the start. But, suprisingly, their Java support seems pretty lacking.

For the grim details of this problem, I can't say it better than the WinZip tech support person:


>There is unfortunately some ambiguity within Zip files as to the
>character set that the filename is stored in. Whenever a file can be
>stored in the OEM character set (which is the character set originally
>used by MS-DOS), WinZip does so for compatibility with other Zip
>utilities. In this case, it marks the Zip file as made by MS-DOS, since
>the original Zip utilities were DOS utilities that used the OEM
>character set for filenames.
>
>For files whose names can't be stored in OEM, because they contain
>characters not present in the OEM character set, WinZip stores the
>filename in ANSI and marks the file as having been made on an
>NTFS-based system. WinZip is here following the lead of InfoZip, a Zip
>utility that since 1993 used this method of marking filenames as being
>stored in the ANSI character set. (ANSI is the character set used by
>Windows 95, 98, and Me, and by many Windows applications. Windows 2000
>and Windows XP use yet another character set, Unicode, but they also
>provide support for the ANSI character set.)
>
>However, some other Zip utilities handle this differently. In
>particular, Microsoft's Compressed Folders program always marks the
>files it creates as having been made on an NTFS-based system, but
>stores the filename in the OEM character set.
>
>So when WinZip sees a file marked as NTFS-based, it has no way to be
>absolutely sure whether the filename is in the OEM character set (and
>needs to be translated to ANSI), or if it is already in the ANSI
>character set. WinZip uses some complicated heuristics to try to decide
>which character set is involved, and it almost always comes up with the
>"right" answer, but there is no way to be absolutely right in every
>case.
>
>We are looking into how to handle this better in a future version of
>WinZip.



3 comments:

Alex said...

My sister told me about her problem,and I advised-recover broken zip files.This tool helped me and friends too not once,also program is free as far as I know.It can recover corrupted files in Zip format and compatible with all supported versions of Windows operating system.

Alexis said...

I often join my zip files and once I did it and both files were deleted!I didn't know what to do.But I used next tool-best recovery zip software,which I found in one forum and tool solved my problem.Moreover it is free and can too extract data from corrupted files with *.zip extension.

Alexis said...

At work with zip files I usually use standart program.But once I had the big problem.All my zip files were damaged and I accidentally used-zip file corrupt,which for me recommended a friend.And software solved this problem in 30 seconds and without payment.