in search for a ( decent ) analyst programmer, project : The Unlimited Compressor

le Redoutable

Ich bin El Glouglou :)
the goal is to create from a homogeneous Winzip-type file a heterogeneous file which will then be eligible to a new Winzip-file, with a gain even as low as 1-2%;

what is a homogeneous file ?
it is a file where all Byte values are represented with a ratio of say 1/180 to 1/300 ( the ideal would be 1/256 if occurences of each value were perfectly homogeneous )
what is a heterogeneous file ?
it is a file where some values appear more often than some others;
for example, text files are essentially composed of occurences of values from 32 to 128 ( or so )

so, here's my method :

first, some statistics :
find the value with the most occurences;
as I printed above, the most occurent value will give a ratio ( for example ) of 1/180;
that means you can use offsets for each occurence of that value within a Byte ( because statistics say offsets shouldn't exceed 180, then 255 ( the max value you can print within a Byte ) should rarely get exceeded;
still sometimes you may end up with an offset of ( 280, 400, or even 850 ) , so you can easily rule that , if you print an offset of 255 it means the offset is equal to 254 + another Byte of 0 to 254 , which again if equal to 255 means you have an offset of 254+254+ another Byte etc
The only problem with appending too much offset values is it adds to the length of the output file ( well, beginning with the most occurent Value somehow mitigates this problem )

ok.

here's the idea :
in lieu of Byte values you use offsets for each Byte value ( in the order of from the most occurent Byte Value down to the less occurent Byte Value )
then, as you print offsets you put a flag where in the original file you located the said offset;
then, each time you check for occurences ( that is, because there are 256 values from 0 to 255 , you will do 256 times the job ) , each time you find a flag you don't add to the offset for the n-th value
quickly an example for a file of 20 Bytes , composed of 6 Values ( 39, 44, 11, 18, 74, 78 ):
01 39
02 44
03 39
04 11
05 18
06 18
07 11
08 78
09 39
10 11
11 44
12 39
13 18
14 11
15 11
16 11
17 74
18 44
19 78
20 39

first, the statistics :
39 5
44 3
11 6
18 3
74 1
78 2

sorted ( and printed to the output file ) :
1 11
2 39
3 44
4 18
5 78
6 74

now look at this :
01 39 +
02 44 +
03 39 +
04 11 + ( offset is 04 - 00 = 4 )
05 18 +
06 18 +
07 11 + ( offset is 07 - 04 = 3 )
08 78 +
09 39 +
10 11 + ( offset is 10 - 07 = 3 )
11 44 +
12 39 +
13 18 +
14 11 + ( offset is 14 - 10 = 4 )
15 11 + ( offset is 15 - 14 = 1 )
16 11 + ( offset is 16 - 15 = 1 )
17 74
18 44
19 78
20 39

so the output file looks like :
4
3
3
4
1
1

next value ( 39 ) :
01 39 +1
02 44 +
03 39 +2
04 11 . (here's a flag )
05 18 +
06 18 +
07 11 .
08 78 +
09 39 +4 ( 09 - 03 , -1-for-flag-at-04 ,-1-for-flag-at-07 )
10 11 .
11 44 +
12 39 +3-1 = 2
13 18 +
14 11 .
15 11 .
16 11 .
17 74 +
18 44 +
19 78 +
20 39 +8-3 = 5

adding to the output file :
1
2
4
2
5

next value ( 44 ) :
01 39 .
02 44 1
03 39 .
04 11 .
05 18 +
06 18 +
07 11 .
08 78 +
09 39 .
10 11 .
11 44 4
12 39 .
13 18 +
14 11 .
15 11 .
16 11 .
17 74 +
18 44 3
19 78
20 39

adding to the output file :
1
4
3

next value ( 18 ) :
01 39 .
02 44 .
03 39 .
04 11 .
05 18 1
06 18 1
07 11 .
08 78 +
09 39 .
10 11 .
11 44 .
12 39 .
13 18 2
14 11
15 11
16 11
17 74
18 44
19 78
20 39

adding to the output file :
1
1
2

etc

note that as you advance in the less common values , the offsets become low ( and that's exactly what the program is for )
in a huge file you should end up with a lot more of low values ( like 001 , 050, 030 etc ) than big ones ( 220, 190 etc )
here's what I call a heterogeneous file :)

if my vision is correct, you will be able to Winzip the output file, giving birth to a new zip file, which in turn will be re-heterogeneoused, for even a 1% gain ( but repeated 1.000 times ( or 1.000.000 times if you want to transfer a 1GB file to a floppy 720 ko lol )

so, where am I wrong ?
 

log in or register to remove this ad

Remove ads

Top