ebook img

EXPERIMENTAL STUDY OF SHANNON-FANO, HUFFMAN, LEMPEL-ZIV-WELCH AND OTHER LOSSLESS ALGORITHMS PDF

9 Pages·1991·0.43 MB·English
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview EXPERIMENTAL STUDY OF SHANNON-FANO, HUFFMAN, LEMPEL-ZIV-WELCH AND OTHER LOSSLESS ALGORITHMS

.corP LRRA ht01 teupmC gnikr owteN ,ccActcfnoC 1991 EXPERIMENTAL STUDY OF SHANNON-FANO, HUFFMAN,L EMPEL-ZWWELCH m REHTO LOSSLESS SMHTIROGLA D. Dutxk and W. Kinmer, VE4WK tnemtrapeD fo lacirtcelE dna retgunpimroeCenignE university of Manitoba Winnipeg, Manit&a, Ca nada 2N2-T3R : x)a4F02( 2750261 :liaM-e ACabotinaMU.mc@fensniK :liam-E AN.NAC.BM.VK~LFV@K~~E~ Abstract egdelwonK fo eht scitsitats fo a ecruos tib maerts si deriuqer nehw gnitceles ehttsom efficients tatisticald atac ompressiont echniquesa ndd esigningt heb estc odes.T hisp aperp resents a program called Statistical Analysis of Files )FATS( that analyzes and supplies the statistics on such bit streams. The two key statistics are the (a measure of information content) and entropy ycneuqeifeht fo table. The entropy measure is used is establishing the compression occurrence timil ,lrsaoecfuiqt isenilhticahetwts e hytcneuqer f feocnerruc csoi lativ ni e hgtningise dfolamitpo sa hcus sedoc htgnel-elbairav namffuH and .onaF-nonnahS lamitpo rof dezylana scitsitats rehtO code compression techniques are run-length encoding, half-byte packing, and diatomic character encoding. In the entropy report, the statistical techniques are compared with a popular adaptive nonstatisticald ictionarye ncodinga lgorithm,t he WZI t echnique,t og iveac omparisonw itho ther methoofd s sselssol cboemonpsf cre hetams f asiorilnkoe ns . 1, NOITCUDORTNI noissimsnarT fo noitamrofni seriuqer slobmys .)atad( fI eht atad noitatneserperfo information is compact (i.e., no redundancy is present) the information can be transferred faster than with redundant data, given the same data transfer rate, expressed in bits per second (bps). snioitamrof n fito iybre v,esuhT elbarrefsnart .s s,e dd ks ctn ntcrA ihoeaaoeorgBhthtpitStro newspaper can be very expensive to send, but the price is worth it -to a stockbroker. ,ylralimiS lasnti ght’ssp ortssc oreasrv ee riym portantttohs ep ortfsa nb,uf tos ro meonwehd oi slikessp orts, eht noitamrofni si.sselesu Informatiocnab snee nvbatay r ietomyfe thodsa,ne da cmhe thohdai sto swb ne nefitasn d pitfalls. For a given probability of error, a shorter transmission would be less susceptible to error naht sti regnol gnidnopserroc dehcnurcnu .ecruos ,sselehtreveN eht retrohs tib maerts sierom sensitive to ,sr~zre such that an error of one bit could destroy the whole information, whereas an error in the uncompressed bit stream could likely be corrected. Which then, is the best method? ,ylraelC eht tseb nKthod si eht ,tsetsaf tsom tcefrep ,dohtem hguoht hcihw eno si tseb rofa particular type of noitamroit is the M46$ question. Data compression can be employed to reduce the dstbathiotretfae w aimt,h lotouhsotaeif n n yg noitamr&i .flesti tsoM atad siyllautca Length yrev evititeper ,)tnadnuder( dna nac eb ”dehcnurc“ os sa ot ev o.msesre neehvtit iAteperdoog elementary example of data compression is secretarial short-hand. People who know the code can refsnart noitamrofni ylkciuq morf a nekops tamrof otni a nettirw .tamrof nehT ta aerom tneinevnoc ,emit eht edoc nac eb dednapxe otni a lamron lautxet .tamrof noitanimilEfo repiemtpirrtaoeinvdvt eueoominn resdet n ashtnse c,ny , noit-fni transmissio n. . 23 Information exchange fills each of our days. If it can be improved in a manner that does not hinder our understanding, ti should be done. Information can be represented in a variety of ,syaw hcus s,as e,rhucteceipps ,skoob . sr ee,lt niudofpnimasoicvelet llA feoseht ,stamrof dna( ynam )srehto eriuqer a refsnart morf a source, to a destination. Thus, the transfer is what can be ,devorpmi dna tsuj tuoba enoyreve nac etairporppa.ti A program designed to analyze any bit stream and provide the necessary statisti information for code design has recently been described .]19iKup ehT lacitsitatS sisylanAfo Files (STAF) program provides the user with a portrait of a file’s statistics. Then, on the basis of ther eport,t heu serc ant hend ecidew hicht ypeo fd atac ompression would1 b e eht msousit t afbolre their particular piece of .atad ehT FATS ma rsgeotrapreneg two types fo( reports: a short and a full-length report. The short report is a one or two page report containing the following two segments: (i) the entropy analysis, and (ii) the sorted frequency of occurrence. The full report sevig eht gniwollof :stluser )i( yportne ,sisylana )ii( a lluf retcarahc ycneuqerf ,troper )iii( eht -flah ,etyb )vi (htgnel-nur ,gnidoc.ne dna )v( cimotai dretcarah.csesylana 3. DESCRIPTION OF PROGRAM MODULES The STAF computer program has five modules. The two types of reports (short and ):lluf are similar in their scope: The short report gives the entropy analysis and the sorted character frequency chart, while the long report gives additional half-byte analysis, repwted gnirts retcarahc ,sisyl acniamot a,isdisylana dna eht lacitirc IICSA ycneuqerf f;oecner.rturcachoc 3.1 yportnE tropeR eludoM The entropy analysis module is divided into an uncompressed and compressed section. ehT si eh tt scoi mstarbap fo eht .,m,ayrlgtosrr pitFi sevig tenh utfooceht uncompressed ancrlysis latot rebmun fo sretcarahc ni eht ,elif hcihw si osl adetrevno cotni .stib txeN eht rebmun fotcnitsid characters in the file is printed. Finally, the source, entropy, si calculated .]cl9sniK[ The H,, analysis gives six important statistics useful for determining whether a statistical compressed noisserpmoc dohtem si .elbisaef ehT scitsitats era debircsed .txen Theoretical Statistical Compression Let us consider a source containing 200 characters: 001 each of T and A. The Es, 50 yportne si detaluclacsa H a -= pi &Pi (3. f) = - 05.02g0105.~=~ + 2 x )52.02go152.0 = 1.5 bits/character 24 where ip is the probability of occurrence of each symbol in the source, and m is the number of slobmys ni eht ecruos txet ro tree btdafeharperlreaefhe(r ot sa .)ecruos The ultimate theoretical statistical compression )CST( is given as the barrier which every statisticaclo dsee ektseo q ualb,u itrs a relyi,ef v earc hievedT.h uel timate CST percentage, ,pU is calculatedfrom (3.2) 8 = 81.25 % where as is the number of bits used to represent a character (normally eight, according to ASCII convention), and is the source code entropy. Notice for this example we have a perfect code, Ha one that is statistically as concise as possible. But in normal practice, a three symbol code is next to useless, and a larger and undoubtedly less perfect code would be created with a regrallobmys set. Theoretical Variable Length Cbdewords elihW eht laciteroehtlacitsitats compression, ,U si eht standard to measure up to, the theoretical variable-length codes entropy is the estimate of the entropy when variable length (V’) coding is used. While it would be necessary to actually construct the Shannon-Fano or namffuH sedoc o tyllau twcoank eht rebmun fo stib dedeen ot edoc nh ec,aleobm yesht rebmunfo estimated ,stib & deriu qoe tredocne a lyotbi mlhyits bifawOborp ip si & = r 2gol ip 1 e llhabtcag iinetrehealTrvoeht edoc ,yportne detaluclac m&f VH, can be m HV --= c PX i i (3.4) i=l In practice, the actual S-F or namffuH code entropy would approach, but never be smaller naht (Hu such that HU ;I ’ffuHro,~F-S s HV )8.2( si syawla.eurt Tk Shannon-Fans Compression Technique ehT onaF-nonnahS )F-S( gnidoc eludom setaluclac a elbissop F-S edoc dna ehtedoc entropy. A separate program was developed to calculates a number of S-F codes using a number fo tn e,rsecfiftisdirue htub eno c iytlstinreute shdiestnaoecrc eht tseb edoc yreve ,emit os ehtFATS program uses only this heuristic. Once the entropy is calculated, it is a simple matter to calculate thlee ngtohfaf i le d&c itn h imsa nnera,n tdh pee rcentagceo mpressiotnh amta byge a inewdi th this method 25 The heuristic used to calculate the S-F tree is this: Starting morf the top down, the whole array containing the sorted character frequency from 0 to the number of distinct characters (symbols) is scanned,, and split in half as close as possible to the middle of the frequency. The program splits the top half, (with fewer characters, but appearing more often) and sevig each characteri nt hiss ectionap refixo f“ zero”,a ndt heb ottomh alfa“ one.” woN t hea rrayi ss pliti nto two approximately equivalent parts gnidrocca( to frequency), and the same heuristic for splitting the initial yam in half is called again, recursively, to split the top, and bottom halves of the array, dna dda eht trxeetncara hnci eht F-S ,edo clitnu lla eht sedoc evah .ndeeetbareneg In the routine that actually finds the middle of the ,yarra a check is added that has been coined “Heat-Seeking; Capability”. Just as a heat-seeking missile or torpedo searches for a source of heat as its target, so this “Find the middle of the array” routine seeks the: closest point to the middle. The less advanced “Find the Middle” heuristics checks the array, returning the point exactliytn h mei ddloeer a rlierw,h ilteh hee aste ekincga pabilitcyh eckosbn o tshi deostf h eex act middle point of the array, and returns the closest point, whether it eb on the top, or bottom side of thaer rayT.h oel mde thocda bnce o mparetdto h wea cyo ntestanthsa tdgo u estsh per iceosif t ems on the popular T.V. game show “The Price Is Right”, where the winning answer was “The contestant closest to the actual retail price, but not over is...“. This method seems a little unfair because the winner was not necessarily the one who was the closest, but the closest one who guessed underneath the actual price. The “Heat-Seeking” capability eliminates the disparity, and returns the split closest to eht .elddim The Shannon-Fano codes are pleasing to display and analyze, since it is very apparent that the code is self-separating. namffuH codes (explained next) are more complicated in creation, and siht stluser ni a edoc taht si ton sa ylsuoivb ofl.egsnitarapes The HufSman Compression Technique The namffuH codes are created using a single heuristic. Just as in the S-F module, a separate program was developed which coded a number of namffuH codes and compared them. For each file tested, the same heuristic gave the best result. Each code had the same entropy, but one heuristic produced codes where the maximum code length was less than others. This is the heuristic that has been implemented in the program. The namffuH code entropy is calculated, and siht sraeppa ni eht yportne troper.eludom ,ylfeirB namffuH codes are created by fusion ,]25ffuw .]al9sniK[ Shannon-Fano codes are created by splitting an array into successive halves, quarters, (etc... (called “top-down splitting”), while namffuH sedoc era detaerc yb yldetaeper )ylevisrucer( gnigrem eht owtslobmys witht hes mallestf requency,c reatingan ewe ntryw itht hec ombined ycneuqerf (c alle“db ottom-up binary fusing”) and adding this new node to the list/array from which the two symbols were located. The two individual symbols are also removed from the list, since they are now detneserper yb a elgnis denibmoc dna( suht )rehg.iyhcneuqerf This process ,seunitnocI always subtracting two nodes from the list, and merging their compositper obabilitbya ctkto h lei stu,n tioln lea rgbei nartyr eiefs o rmedT.h e namffuH co dcea n then be read from the: tree, starting from its corresponding leaf going up through the branches to the root, where each step up “left” is assigned a “one”, and each step up “right” is assigned a .”orez“ A more detailed description, with examples of both namffuH and S-F codes is given in .]al9sniK[ The LZW Compression Technique sihT mhtirogla si a lacits iytraatnso-i nt,ocmnihdtirogla dna suht ti si elbissop dna( likely) that coding in this case could exceed the symbol based entropy, onalvya ilable Ha. Currently the statistics are those compilebadyc tuallryu nnintgh e sourcbeis tt ream through this algorithm. The technique creates a dictionary, and this heuristic sraelc the dictionary when it fills up. Sincteh e ,seuqa i lndahec cd ist eu ntesetl nyisrnl ceo piuiuenhseotgadhitirraig otrattfarmansopefmoc eht fo noitpircsed deliated mhtixcgla in found be can ,]19rGiK[ ,]bl9sniK[ welc84], .]88rotS[ 2.3 ycneuqerF tropeReludoM This second module in the STAF program provides the resu htiw owt tnereffid sepytfo charts: (i) standard, and (ii) sorted charts. The full length report calls for both charts, while the short report calls far only the &tros .trahc Sorted Frequency Chart The sorted frequency chart looks at the whole eitf to be analyzed, and sorts the symbols appearing in the source file in a descending order. Characters in the normal ASCII character set which do not appear in the file are not shown, for clarity. The chart shows each character, the integer count of the number of times it appears in the source, and the frequency or percentage, pi, fo eht elif taht si ta h ct.irfeitcceaprsah cA lacipy ttxet s’el idfetro strahc dl uyolwbab onripgebhtiw ,>ecap Sd<ewoll oyflbaborp yb <CR> egairrac( )nruter dna neht spahrep eht tneuqerfsrettel ,A,T,E and so on. Recall that the Shannon-Fano and WfuH codes are constructed on the basis fo siht.trahc Stan&d Frequency Chart entire the prints ChartF requeSntcayn dard The 256character htiw hcae tes retcarahc IICSA sti gnidnopserroc tnuoc dna .ycneuqerf sihT trahc dluow eb lufesu ot wohs woh ehttnereffid sretcara hecra ,desu dna ,hcihw fi yna skcolb f osretcara hecra rehtie dyelseuvisnet xreo ton ta.lla A typical Analog-to-Digital converter supplying tib-8 data could create data where this chart or a nloaictiahtpnaersger pfeor ti dluoc eb lufesu ni noitcnujn ohctiw rehtona epyt fo atandoisserpmoc using patterns derived from inspection of the chart. Coding could take advantage of “clusters” of niatr esclobmys gnieb desu erom ,netfo dna sedoc dluoc eb d.eytlaglnuicdlraoccca 3.3 etyB-flaH gnidocnE eludoM Certain types of data contain extensive numerical figures. Business charts, spreadsheets, kfitneics STAF the charts frequency sorted the even (or measurements margorp generatesc)o ntain many numerical figures which could be compressed. Normally, a character byte takes eight-bits ,)aS( tub fi eht noisserpmoc mhtirogla setapicitna a laciremun ,gnirts a tib-ruof edoc dluoceb ot dezilitu ede stigid eht orez ,enin ot gnitmc a cumpressim ratio ylraen fo L905 Thifso ur-bicto dmee an1sc6 h aractercsa bnee n codetdh iwsa ys,to h aatf tetrh tee dni gits era ,dengissa ereht era xis artxe desunu ,sedoc hcihw dluoc eb deifissalc sa “laciremun“dna encoded as such. For example, a phone rebmun (1-800-555-1212) contains 14 characters, including the three hyphens. This could be encoded into nine characters (A starting control byte, length of string, and the 14 nibbles) for a 64% saving. For longer numerical strings, a savings approaching %05 could eb achieved. Typically a compression program would assign the six extra sretcarahc ot eb seno taht yllamron wohs pu ni noitaicossa htiw .srebmun .)“.,-/*$“( ,suhTeht ,4321-866-00 8:-ellbad o dcel nbuseogwn i srfe topgsyntiwollof 1984,1988,1990-91,08/07/91- s/01/68 ***$ 1 432 65 dna .4823-67110-2122-3 roF ylhgih lacitsitats dna laciremun ,selifa noikrpmoc fo ;p & b&5 cou abltedt ainweidt h etyb-flah encoding. 27 3.4 htgneL-nuR gnidocnE eludoM ghnt ig)dnEoeLclRn-(en usRi a yrev simple ,euqinhcet lufesu rof ylhgihevititeper sretcarahc .sgnirts e htTs reipfyt fo ELR si nehw eht redocne setapicitna a rebmun fo.sknalb es enhaTc eb ldecalper ht iowwt :sretcarahc as pecialc haracter,a ndt hec ount,n ,o ft hen umbero f blanks. Thus, in a chart or hparg that has 10 sknal bni a row, they could be encrypted as a laiceps lortnoc langis ,retcarahc dna neht eht rebmun ,net ot yfingis 01“ sknalb ni a ”.wornA extrapolation fo this technique is a three character e&c used to represent repeated characters. For ,elpmaxe fi eht redocne sretnuocne xis ,sR yeht dluoc eb detpyrcne sa a laiceps[ 1retcarahc the lani&o 1retcarahc replated .]semit 6 yllamroN ti si etiuq{ erar ot evah erom naht owt or three characters repeated in a row, so this euqin hdcleutow ton eb a yr elvuftiu r.fe neolpi tsleucMaps ,revewoh era a elttil erom,nommoc and are prime candidates for compression. This module only counts repeated strings of three or erom ,sretcarahc esualceb compression savings only start with repetitions of more than three characters. 3.5 cimotaiD sisylanA eludoM ehT cimotaid e unqoii nshsceertpmoc is another very specific technique, which takes certain strings of characters and replaces them with a shorter code. In this case, the program looks for all ofc haracters,a ndr eplacest hem ostc ommonp airsw itha s inglec haracterc ode.T hisr esults pairs in a compression percentage of ,6905 though it probably would not be possible because of the limited number of available characters to signify pairs of characters. Thus only the most common pairs would be encrypted. The most common pairs in the english language are ,-E ,T-. TH, ,A- and -S .]78dleH[ STLUSER 4. EXPERIMENTAL Experiments were performed on a yteirav fo ,selif tsom rof eht MBI .CP ehT seliferew chosen for benchmarking the results. The F AmTaSrgo rspaw nur no a &M-33 IBM Compatible computer, and output sent to an ASCII lif .e on disk. Table 4.1 summarizes the output the STAF mar gdoertpaerc no hcae fo eht.selif kramhcneB Files noitpircseD 4.1 COD.EMDAER - This text file contains mostly upper and lowercase letters, with a gnilknirps fo a wef.srebmun It is Borland’s Turbo C++ help file containing the last minute changes to their product. The eitf contains 82 distinct characters, and would represent a typical word rosseco rrpettel ro.tnemucod PLH2PLEHIMA - The Windows 3.0 HELP reference file for the Desktop Publishing margorp ,IMA available to the PC user when >lF< is pressed. The file consists of mostly IICSA text, but contains a generous sprinkling of control and non-printable charactersnecessary rof swod noitW eb e.1ba o tterpretn,iti EXE.SKROW - One of two executable files tested. This si the EXE. code morf the popular snoitacinummoKeehsdaerpS/esaBataD/gnissecorP-droW software package from Microsoft. This fits in the category of non-windows applications. EXE.4VBELLIM - Another executable file, this one is an implementation of the popular card game Mille ,senraB A good mouse-based, windowed (Not MS ‘Windows) program with tnellecxe .scihparg elbaT1.4 FATS sisvlana seliF kramhcneB fo L d - -- !yportnE laciteroehT 0naF-axonnahS namiE ELIF LZW rpmoC neL / var. htgneL coding coding coding rpmoC PC! COdeS I CODEMDAER 3.904 4.825 bits/char 5.362 4.92168 4.8565 1 30261()setyb 377J! setyb 10861 9836 28( )sdbmys 39.7 % 32.97 9b 38.479 % 39.294% 51.194 96 I ~1HELP2.HLP 4.898 bits/char 5.337 4.94101 4.93135 2.870 556972( )setyb 171232 bytes 186561 172722 172384 100329 152( )slobmys %77.83 33.289 % 38.237% 38.358% 64.124 96 L WORKS.EXE 7.409 bit!#char 7.956 7.45306 7.43783 7.822 (381n6 bytes) 353556 bytes 379676 355674 354947 373275 )slobmys (256 7.392% 0.55% 6.837% 7.027 96 2.227% EXE.4VBELLIM 7.142 bits/char 7.697 7.23544 7.16612 5.934 (21- NW 193254 bytes 208250 195768 193892 160549 )slobmys (256 10.719% 3.791% 9.557 % 10.424 % 25.828% I PMBREGIT 560.1rahc/stib 1.292 1.23237 1.23224 0.153 (212086bytes) 28246 bytes 34246 bytes 32671 bytes 32667 bytes 4069 bytes (55 )slobmys 86.682% 83.853 % 84.595 96 84.597 96 98.081% I MCPTIF 6.469 bits/char 17997 6.587 6.4998 6.279 )setyb (20556 16622 bytes 7.004 bytes 16924 bytes 16701 bytes 16134 bytes (202 )slobmys 19.14% 12.45 % 17.67 96 18.75 % 21.51 % m TXI-SREBMUN 3.733 bits/char 4.375 3.78104 3.75881 2.534 83761( )setyb 7810 bytes 9155 bytes 7910 bytes 7864 byes 5302 bytes 622( )srahc 53.342 96 45.307 96 52.737 96 53.015% 68.324 96 I -lXT.TROHS 4.137 bits/char 4.582 4.1868 4.1656 5.038 498( )setyb 463 by&es 512 bytes 467 bytes 465 bytes 563 bytes 43( )sdbmys %492.84 42.729 96 47.679 96 47.945 % 37.025% 5V PMB.REGIT - A very simple Windows 3.0 Paintbrush drawing of a cat’s face. Although &is yam n otb ea sd etailed it tyap i cdarla winiiglt,l ustrattewhsae s ttheca ebatlen i minated morf yreve.gniward voice sample of a person speaking the word “FIT.” The sample is FI’I’.PCM - A digital ni eslup edoc noitaludom )MCP( .mrof sihT elpmas tib maerts dluow raeppa,modnar stored unlesdsi splayeidn PCM .tamrof lacihparg TXT.SREBdhUN - The STAF program creates reports which contain many numbers. The ffe TXT.SRE~~ si yllautca eht FATS qort rof C OeDh.tEMDAER .eitf The sample of this 29 file can be created by running STAF on ,COD.EMDAER dna then gninnur FATS no teuheptltiufo ,2TSLEMDAER retfa tmf a retnoa imti ng tnereffid name . TXTTROHS - A shorter memo, created to illustrate namffuH and S-F codes. It contains ylhguor 03 tcni t,ssildobmys dna si desu ot etartsu lelsieht owt sepyt fo gnidoc.ylpmis 4.2 Observations The specific implementation of the namffuH code si retteb naht :eht F-S edoc niyreve instance of the benchmarks that were tested. From the selected ,seilf it appears that the more slobmys in xlt source bit stream, the smaller the difference becomes. The two EXE eitf ,SKRO W( ,)4VBELLI4h h cganei slula 652 &bit t isbeidhoxce secnereffid ni code yportne fo 510.0 dna70.0 bits/character. The two text files on the other hand CODEMDAER( (82 symbols) and NUMBERSTXT (226 symbols) exhibit differences in code entropy of 56&O and 0.02. The difference decreases in proportion to the number of symbols in the bit stream. The PLH.2PLEHIMA file also supports this observation: It too has more than 250 symbols in the file, dna eht namffuH dna eFc-nSeref fnii dyportne si yletamixorpp.a10.0 The TNGER.BMP picture is quite simple, and comes in with an entropy of a resounding 1.065 bits/character. The namffuH and S-F code entropy is virtually identical, though the S-F yportne si llits.regral The file SHORTTXT si a llams hpargarap depyt ni morf a ,koob gnisu esacrewolsrettel only. Containing 30 imslehlmouors,tta r ccalhteaiiarsnra glc iytt e rs, na&uH aScn-odFd t ehsa t na cy l.idseatec ueerbhtTsnocer WZL )%.73( m htirogla does not appear to perform as yltneiciffe as the variable length coding routines (48%). This is due to the brevity of the file. If the kif semoceb yna ,regral eht WZL ll imwr o,frreetp teecbnis ti lliw eb elba ot dnif yecrnoamdnuddnear snrettap ni eht.txet The WZL com pressiaolng oritnihosmn -statisticaaltn,hdi us tsts a tisticcrasen a lolnybl ey used for comparison. The LZW mhtirogla si retteb rof hcae elif yb yletamixorppa 15.20% except for the EXESKROW ,margorp erehw eht WZL enituor sekam a ylbakramer roop.gniwohs Typical LZW gives compressions of 5060% rof lamron txet ,seitf and only %03-02 roferom evenly (or random) distributed probabilities like the EXE files and digitized speech. The LZW routine compresses and amazing 98% on the TIGER.BMP picture, though the picture is a relatively simple, roloc& .margaid Nevertheless, all BMP stored format pictures can be compressed - some more than others. .5 SNOISULCNOC This paper presents a statistical analysis of files computer program developed to design optimal statistical codes for data compression. A set of benchmark files has been selected to test relative merits of the Shannon-Fan0 ,)F-S( ,namffuH and Lempel-Ziv-Welch )WZL( optimal codes. Entropy calculations and frequency of occurrence help in assessing the best statistical techniques that can be applied to the given data stream. The frequency of occurrence is used to design optimal namffuH and S-F codes for that stream. The optimal codes are designed on a set of scitsirue hosla detaulave ni eht .yduts ehT ncoiiftiactenpesmel pfmoi eht namffuH edoc sraeppaot retteb than the optimal S-F code for all the seitf tested. Our implementation of the LZW algorithm is also better than namffuH yb 1520%. The LZW gives compressions of %06-05 rof lamrontxet ,seitf and 2040% for executable files. This statistical analysis of files program could be used as a tooildn e signinagn idm plementincgo mpressioanl gorithmisfn u turpea ckerta dio sSBB an odt her data refsnart.smetsys 30 ACKNOWLEDGEMENTS This krow was supported in part by the ytisrevinU of Manitoba and the Natural Sciences CounRcEeinslge ianrecehr ianngd )CRESN( Canada.of REFERENCES ]19iKuD .D kceuD and .W ,rensniK mAa“rgorp rof sisylana fo ”,sedf statistical Technical ,8-19LED .g uA 1991,X) pp . Report, 178dleH[ ,dleH and eraifoS G. Data Compression: Techniques and Applications, Hardware kroY :)YN( ,yeliW 1987 (2nd ed.), 206 pp. 44H33D.9.67AQ( Considerations. New )7891 125ffuB D.A. ,namffuH A“ dohtem rof eht noitcurtsnoc fo ycnadnuder-muminim”,sedoc vol. 40, pp. 1098-1101, Sept. 1952. Proc. IRE, ]19rGiK[ .W rensni Kdna .H.R ,dieifneerG hcleW-viZ- l eephmTe“L )WZI( data compression algorithm for packet radio,” Proc. IEEE Conf. Computer, Power, and Communications . y;,aKaMSnigeR( ,03-92 ,)1991 225-229 pp., 1991. Systems, 9sniK[ a]1 .W,rensniK weiveR“ fo atad noisserpmoc ,sdohtem gnidulcni,onaF-nonnahS ,namffuH ,citemhtira ,rerotS ,hcleW-viZ-lepmeL ,latcarf neural network, and wavelet ,1-19LED Jan. 1991,157 pp. algorithms,” Technical Report, 9sniK[ b]1 .W,rensniK sselssoL“ dna yssol atad noisserpmoc gnidulcni slatcarf and neural ”,skrowten Proc. ht. &oC ,re tupmoC Electronics, Communication, , y,rlaogrltanCo(C ;BA .rpA ,01-8 ,)1991 731-031 pp., 1991. 9sniK[ ]cl .W,rensniK sselssoL“ atad noisserpmoc rof tekcap ”,oidar Proc. 10th retupmoC .tpeS ;AC ,esoJ ,03-92 ,)1991 Protcheie1sd9 i9n1g .s, Networking Co&, (San ]88rotS[ .A.J ,rerotS kroY :)YN(retupmoC Data Compression: Method and Theory. New ecneicS .H.W/sserP pp. 1988,413 Freeman, 67S33D.9.67AQ( 1988) ]48cleWp T.A. Welch, “A technique for high-performance data compression,” IEEE Computer, vol. 17, pp. 8-19, June 1984. 31

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.