学術情報処理研究 No.1 1997 pp.25-34
Hiraiso Solar Terrestrial Research Center, Communications Research Laboratory,
3601 Isozaki, Hitachinaka, Ibaraki 311-12
TEL: 029-265-9729 FAX: 029-265-9728 firstname.lastname@example.org
*Information Processing Center, Chiba University,
1-33 Yayoi, Inage, Chiba 263
TEL: 043-290-3536 FAX: 043-290-3544 email@example.com
**Department of Physics, Faculty of Science, Chiba University,
1-33 Yayoi, Inage, Chiba 263
TEL: 043-290-3724 FAX: 043-290-3720 firstname.lastname@example.org
To overcome this dilemma, we will propose a new compression algorithm. Data compression is a very powerful way to resolve the above mentioned problems, since it can reduce a data volume. Users can frequently write their dynamical simulation results to files and can save cpu time for the output, also.
Data compression method has been proposed by many authors. It can be divided into two groups; i.e. 'lossless' and 'lossy' methods. Lossless means that the compressed data can be restored perfectly. An example is a gzip which mainly operates in a unix machine. The method is appropriate for the sequential and not so large text data. On the other hand, lossy compression method can archive drastic compression, because it loses lots of information from the original data literally. The compression algorithm is apposite to the image data. The lossy method has a common risk: i.e. failure of a compression. The meaning of 'failure' or 'success' is as follows. If an image drawn with compressed data bears comparison with an image drawn with the original data, it is successful. There is a case when an apparent difference exists between an image drawn by compressed data and original image. This case fails in compression. Typical lossy compression method is JPEG. This method depends on the data and a compression is not always successful.
It should be noted that there is a case in which compression fails by lossless method, where a 'failure' has different meaning from the word used in lossy methods: i.e. the failure case is that the volume of compressed file is comparable with or even larger than that of the original file. So, the compression is, generally, very useful but its success depends on the original data.
In this paper, we introduce a new algorithm of a data compression. The data which we target are like image data, which are mainly used for visualization and the volume is huge. Since the data require drastic compression, our compression algorithm can be lossy rather than lossless. On the other hand, numerical data obtained by the numerical simulations can be sequential stream of characters (text) like the data usually compressed by lossless methods. Our compression method, therefore, can be placed in the medium between above two groups.
In the next section, we describe the algorithm of the compression, show the compression efficiency obtained in the astrophysical models, and compare the contour maps of time evolution of numerical simulation. A galaxy formation will be drawn with the compressed data by our algorithm and with the original data. Summary and future work will be given in Section 3.
The key point is a way to abstract minimal and necessary information from the original data. Our compression algorithm is as follows.
If the data distribute linearly with one peak, the following simplest process can be applied:
This method is also applicable to the more complicated cases such as that the density distribution has two peaks or to the vector fields such as velocities or magnetic field. We return to this point in the final section. As for the number of levels, we can use more than 256 levels, if necessary. The file size of table is given by (4byte × number of contour level + header of file 〜 1Kbyte). If the original data are expressed in single precision, those volume are compressed to about a quarter in this part of our algorithm.
Since the compressed data are still sequential text (i.e., stream of characters), we can compress them furthermore by using lossless compress tool. Thus, in our algorithm, the data go through compression twice. As the lossless compression tool in the second part, we adopt the LZW15V.C in this paper. The algorithm of this tool is based on LZW and it is same as 'compress' which is very familiar for unix machine users (see reference). The source program of the first compression tool based on our algorithm is written in FORTRAN language and that of the second one is written in C language which is public, so we can compress the data in the application program from which the compression programs described above are called. We release those subroutines to public such that anyone who wants can use them (please contact the first author). It is noted that any lossless compression tools are available for the second compression tool.
The compression efficiency (CE) is defined by
where m is the volume of the compressed data, called map data and r is the volume of the original data, called raw data. This definition means that as CE tends to 1, compression is more efficient. If CE is negative, which case is possible in general, that compression is in failure. Since the size of the map data is always a quarter of the the original data based on our compression algorithm when the raw data are expressed in single precision, the compression efficiency is always 75% for this part.
First, we obtain CE for the snapshot data of the density with one peak for three models which are
Fig.1 Contour maps of the density drawn with the raw data and with the map data
for Model I. Two-dimensional plot on z=constant surface.
Fig.2 Same maps as figure 1 but for Model II.
Fig.3 Same maps as figure 1 but for Model III
The contour map of the density drawn with the raw data and with the map data for Models I, II and III are shown in figures 1, 2 and 3, respectively. There are some differences between two pictures in the low density region in figure 3. However, the main structure; i.e. the high density region, can be reproduced by the map data in this model. It is noted that the interesting physical process does not always proceed in the high density region when the system includes magnetic field. In that case, the analysis using only the density contour maps is not enough and the maps of the magnetic fields and/or velocities are also needed. We will investigate this in the next paper. At least we can say that the compressed map is comparable with the original map in model III and that these compression can be regard as successful for three models.
In table 1, we show the volume of the raw data(R), the volume of the map data(M), the volume of twice compressed data(C2) of the map data. The compression efficiency is larger than 93% for each model. The original data can be decreased to be less than %, 1.9% for model II which shows the most complicate plot and has the largest volume of the data.
This is very important because we can store data of ten times larger (about fifty times for model II ) time steps for the analysis of time evolution under the same disk space condition. This means the time evolution of our interesting phenomena can be traced in detail, so it is possible for us to find physics which may be missed in the coarse snapshots. The cpu time is also limited in most super computer systems. If the cpu time for I/O cannot be negligible in one job class, reducing the volume of output data enables us to increase cpu time for processing operation under the constant number of snapshots. In order to investigate this point, we follow the time evolution of galaxy formation by 3D numerical simulation. In this calculation, we insert the subroutines for the compression and write the compressed data as well as the raw data for comparison to files. We save the 50 mapdata snapshots and the 4 raw data snapshots.
The CPU time is given in table 2. The ratio of the CPU time including execution of compression and output of the data to the total CPU time is about 1.7%. We can say that it is negligible. The cpu time for execution of compression at one time is 0.67 sec, that for execution of the second lossless compression and for I/O of the map data at one time is 4.33 sec and that for I/O of the raw data at one time is 3.25 sec. The cpu time for only I/O of the map data cannot be compared with that of the raw data, because it is difficult to abstract the cpu time for I/O from 4.33 sec. However, the volume of the map data is always less than a quarter of the volume of the raw data, then we expect tio-map < tio-raw. So the cpu time for execution and I/O of compression is negligible.
The file sizes of the raw data and the compression efficiencies of selected map data from 50 snapshots are shown in table 3. The file size of the raw data is same for all four snapshot data. The CE is nearly 75% at the initial time, which means the second lossless compression almost does not work because the CE of the second compression tool may be related to simplicity of the configuration of distribution of the data, and there are no simple structures initially in this simulation model. This model is same as Model II. The configuration of the density distribution and CE are different from those of Model II (see figure 2 and figures 4a, 4b in the following) because we calculate the logarithm of the data and then classify those data into the contour level in this calculation. Since degeneracy of the contour of the density is relaxed, more complicated structure appears and CE becomes low. We take 251 contour levels, so the file size of the table is negligible compared with the total file size of the map data.
Fig.4 Density maps in a shade of monochromatic color on z=constant surface. The raw data 4a-A 〜 4a-Z and selected map data 4b-1 〜 4b-50 are lined in time sequence.
We present the density maps in a shade of monochromatic color in figures 4a and 4b which demonstrate time evolution by the raw data(4a) and by selected map data(4b). The figures 4a-A and 4b-1 are snapshots at the same time and the figures 4a-C and 4b-50 are also snapshots at the same time (there are no corresponding figures for 4a-B and 4a-Z in figures 4b and figure 4b-Z shows later stage than figure 4a-C or 4b-50). These plots show the process of galaxy formation. At the initial time (figures 4a-A or 4b-1) there are not so large depth in the shade since the gas distributes without any structures. As time evolves, the gas gathers and makes high density region (drawn in black color) owing to gravitational instability and the large scale structures appear as seen in the final map (figures 4a-Z or 4b-50). We can see little differences in these pairs and again compression is successful for this model. The CE almost grows monotonously according to time evolution as shown in figure 5 because the configuration is simplified after formation of the large scale structure. It is suggested that the CE may be a kind of index of complexity of configuration like entropy in hydrodynamics or genus count in geometry.
Fig.5 Time evolution of compression efficiency. The horizontal axis is a scale of the univers,
which plays a role of time because it grows monotonously.
Furthermore, since the volume of the map data is small, handling of those data is speedy and light. For example, the maps in figures 4a and 4b are drawn on two-dimensional surface at z=constant. We can see the maps at other surface easily by using the map data. We conclude that the map data is very useful for analyzing the numerical data.
Since the compressed data by our method is still sequential text, we can compress those data once more by lossless method. We adopt LZW15V.C whose source code is public. We wrote the subroutine program in FORTRAN language for our compression method for the propose that any one can use the subroutines of our method and LZW15V.C written in C language.
We have shown the practical applicability of this method by adopting three typical models. The final CE are more than 93% for three models. Those CEs depend on the second lossless compression tool and may depend on the configuration of the data distribution. In order to investigate this point, we simulated time evolution of galaxy formation numerically and wrote the map data at 50 times and the raw data at 4 times to files. The cpu time necessary for compression is negligible, compared with the total cpu time and the CE grows up as the configuration becomes simplified. Thus we suggest that the CE may be regard as a kind of indices showing simplicity or complexity.
We adopted the simplest setting for the contour level and applied this only to the density variable in this paper. However, some device is needed e.g. for the case that the distribution has two peaks or for the vector variables. The contour level is not necessarily be set equally and it is better that it is set automatically with constant degree of information loss according to change of minimum and maximum of data and of distribution. In the next paper, we will present the improved compression tool and will apply the renewed tool to the vector variables.
This method also can be used for volume data in AVS which is one of graphical software and many people use for multi-dimensional graphics. The visualization is very important for analysis of multi-dimensional numerical simulation data, so in future we will present the I/O module for AVS and/or IDL (also the graphical software) which will be welcomed by researchers who work with numerical simulation.
(Received 25 July 1997 ; accepted 15 August 1997)