Sunday, September 10, 2006

Tons of POS Transaction File

For the purpose of POC for a retail company's data mining initiative (again), I being trashed with 90k number of small files containing 1 year POS transactional data, nearly double the amount of files/data for previous POC effort. Almost 70% of these files are compressed in Z format. Made a quick study on java.util.zip package provided in Java SDK 1.4.2 (The package is available from SDK 1.1), no luck, the standard facility only supports ZIP and GZIP formats. Ok, fine. I made a search in sourceforge, looking for any open source java implementation. None of the search results are directly useful to my decompression need. Then, I tried Winzip, WinRAR, PowerZip and etc. Emmmm, most of the Windows GUI version of these programs are able to decompress Z archive, however none of them provide batch processing. Darn, I not going to decompress 90k files one by one, am I look like that dumb? Ok, thinking of command line version of Winzip and Winrar. Ooops, unfortunately enough, they don't supports Z format in their command line version.

Decided to do some research on Z archive and found this useful article

Uncompress gz and Z format

It seems to me that Z format and many other compression formats are natively supported by UNIX systems. So sad for Windows users.

Also, read this
Wikipedia: List of archive formats

Uppercase .Z is a different format compared to lowercase .z file. Generally .Z is produced using UNIX's compress command, whereby .z is by UNIX's pack command. Algorithms used for the compression are different too.

Since I only got limited time for this decompression task, I finally settled with GUNZIP program, that's freely available (http://www.gzip.org/) and performed a batch decompression. Proceed to the ETL phase then.

And here is a forum post that I found stating similar decompression requirement. Most probably I will use Runtime.exec to call out external utility such as GUNZIP, rather than trying to find a Java implementation for integration. Anyway, it's depends on the amount of time I have.

Similar Issue

No comments: