ebook img

Reasoning about Spreadsheets with Labels and Dimensions PDF

26 Pages·2010·1.99 MB·English
by  
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Reasoning about Spreadsheets with Labels and Dimensions

Reasoning about Spreadsheets with Labels and Dimensions(cid:63) ChrisChambers MartinErwig OregonStateUniversity OregonStateUniversity [email protected] [email protected] July10,2010 Abstract Labelsinspreadsheetscanbeexploitedforfindingformulaerrorsintwoprincipallydifferentways. First,thespatialrelationshipsbetweenlabelsandothercellsexpresssimpleconstraintsonthecellsusage informulas. Second,labelscanbeinterpretedasunitsofmeasurementstoprovidesemanticinformation aboutthedatabeingcombinedinformulas,whichresultsindifferentkindsofconstraints. In this paper we demonstrate how both approaches can be combined into an integrated analysis, whichisabletofindsignificantlymoreerrorsinspreadsheetsthaneachoftheindividualapproaches. In particular,theintegratedsystemisabletodetecterrorsthatcannotbefoundbyeitheroftheindividual approaches alone, which shows that the integrated system provides an added value beyond the mere combinationofitsparts.Wealsocomparetheeffectivenessofthiscombinedapproachwithseveralother conceivablecombinationsoftheinvolvedcomponentsandidentifyasystemthatseemsmosteffectiveto findspreadsheetformulaerrorsbasedonlabelandunit-of-measurementinformation. Key Words: Spreadsheet, Dimension, Unit of Measurement, Static Analysis, Inference Rule, Error Detection 1 Introduction Spreadsheets are widely used [1] end-user programs that contain many errors [2]. To improve the quality of spreadsheets a variety of approaches to prevent, detect, and remove errors from spreadsheets have been investigated. Sincepreventiveapproaches,inprinciple,havetointerferewiththecreationprocessthatmakes spreadsheets so attractive to end users, much research has instead focused on the detection and removal of errors. One type of error that can be detected in spreadsheets is dimension errors, which occur when units of ∗ThisworkispartiallysupportedbytheNationalScienceFoundationunderthegrantITR-0325273andbytheEUSESConsor- tium(http://EUSESconsortium.org). 1 measurementareusedincorrectlyinformulas. Unitsofmeasurementscanbeemployedasaconcretenotion oftypesthatiswellknownamongendusers[3],andareusedtocharacterizedifferentkindsofvalues,much liketraditional,moreabstract,typesystemsusedingeneral-purposeprogramminglanguages. Forexample, a floating point number, which has just one type, can nevertheless represent different kinds of quantities, suchaslengthortimevalues. Several systems [4, 5, 3] have been developed in order to deal with dimension errors. Among these dimensioninference[4]isamethodthatcanbeusedtoautomaticallyfinddimensionerrorsinspreadsheets. Thisapproachhasbeenshowntoworkreliablyandeffectivelyinmanycases,howeveritdoesnottakefull advantageoftheinformationprovidedinthespreadsheetasitdoesnotutilizethestructureofthespreadsheet andfocusesonensuringthatformulasaredimensioncorrect. In contrast, there are several systems that are designed to directly take advantage of the labels and the structureofspreadsheets. Thesepurelylabel-basedapproaches,suchasUCheck[6]orthesystemdescribed in [7], are designed to find formula errors caused by inconsistent label usage. This technique operates in two distinct analysis phases. The first phase defines header or label information for the entire spreadsheet. UCheck is able to infer this while most others require users to annotate the labels for every cell. Once the headersaredeterminedforasheet,labelsareassignedtocellsbasedonheadersandformulas. Inthesecond phasethisinformationisanalyzedtofinderrorsinformulasbyidentifyinginconsistentlabels. OnethingtonotewhenlookingatUCheckanddimensioninferenceisthatbothsystemsrelyonheader andlabelinformation. However,thisinformationisusedquitedifferentlybythetwosystems. Bycombining dimensionanalysiswiththepurelylabel-basedapproachesthestructureofthespreadsheetcouldbeusedto helpstrengthenthereasoningofthesystem. TosomedegreethiswasalreadytriedintheSLATEapproach [5]. However, SLATE only transforms labels and dimensions and does not identify errors. Moreover, the fact that SLATE is a stand-alone spreadsheet system that cannot be integrated into Excel together with the additionaltimerequiredofausertoannotateaspreadsheetrenderstheapproachcurrentlyimpractical. In recent work [8] we have designed an integrated system that combines label-based reasoning with dimensioninference. Thisapproachwasachievedbygatheringbothlabelanddimensioninformationabout acell. Inmanycasesaspreadsheetcontainsdimensionsononlyoneaxis,withthelabelsonthedimension freeaxisgoingunusedindimensioninference. Theseunusedlabelscanhelptoprovidestructuralinforma- tion that can be exploited by the reasoning system behind UCheck, and create a system that can check a spreadsheetforunitofmeasurementerrorsaswellasdetectlabelerrors. When run on the EUSES repository this combined system was able to detect several errors that nei- ther UCheck nor dimension inference were able to discover. This brief evaluation shows the validity of a combinedlabelanddimensioncheckingsystem,butwasthisthebestpossiblecombination? In previous work [9], a system architecture was developed to study how two specific systems could 2 be integrated, in this case, WYSIWYT and UCheck. In this work, errors are introduced into spreadsheets andseveralcombinationsofthesystemsaretestedtodeterminethebestpossiblecombinationofreasoning systems. Theprocess showedthat whensystems arecombined theresults arepositive andmore errorscan bedetected. In this paper we will describe several different options, along with the original integrated label and dimensionanalysis[8], thatcanbepursuedtocheckaspreadsheetforlabelanddimensionerrors. Wewill evaluatethesedifferentsystemsbyrunningthemonthe487dimensionandformulacontainedsheetsinthe EUSEScorpus,asdeterminedin,[4]anddeterminetheefficiencyandcorrectnessforeachcombination. The rest of this paper is structured as follows. In Section 2 we illustrate the issues involved in adding label reasoning to dimension inference with a small example. In Section 3 we formalize spreadsheets and presentmodelsofdimensionsandlabels. TheoriginalcombinedanalysismethodisdescribedinSection4. InSection5wedescribethedifferentmodificationsmadetothesystem. Section6reportsontheevaluation oftheseprototypicalimplementations. WediscussrelatedworkinSection7andgiveconclusionsinSection 8. 2 A Motivating Example Toexplainhowtheintegrationofspatialandsemanticlabelanalysisworks,wewillshowhowbothdimen- sioninferenceandtheintegratedsystemworkonthespreadsheetinFigure1. Thisspreadsheetiscalculating howfarspecificcarscantravelonafulltankofgasbasedontheresultofadriveusingonlyfivegallonsof gas. A B C D E F 1 Car Miles Gallons MPG Total Gallons Miles Possible 2 VW Bug 180 5 =B2/C2 12 =E2*D3 3 Camry 225 5 =B3/C3 10 =E3*D3 4 BMW 300 5 =B4+C4 15 =E4*D4 5 =SUM(B2:B4) =MAX(F2:F4) 6 Figure1: ExampleSpreadsheet When the spreadsheet is checked with dimension inference, it would first identify the headers for all cells. Whentheheadersareanalyzedfordimensioninformation,B1,C1,D1,E1,andF1allmaptoavalid dimension. Thiswouldallowthesystemtocheckallformulasinthisspreadsheetfordimensioncorrectness. Inthiscase,thesystemwoulddetectthatthereisanerrorincellD4wheretheformulaistryingtoaddmiles andgallons. Upon further inspection it could be noted that the spreadsheet contains another error. In this particular 3 example, the cell F2 has the formula E2*D3. The dimension for E2 is Gallons, and the dimension for D3 isMilesperGallon,which,whenmultipliedtogetherresultinthedimensionMiles. Thisresultcontainsno dimensionerrors,butitdoesnotseemright. E3isactuallytotalGallonsfortheCamry,whileD2istheMPG fortheVWBug. Logically,theresultdoesnotmakesense,however,plaindimensioninferencewouldhave nowaytocatchthis. By integrating label reasoning and checking that formulas are both dimension and label correct, the systempresentedinthispaperisabletoidentifyapreviouslyunnoticederror. Thefirststepistodetermine whichheaderaxis(roworcolumn)willbeusedasthedimensionaxis. Inthiscasethereareseveraldimen- sionsonthehorizontalaxis(row1),butdimensionlesslabelsontheverticalaxis(columnA).Therefore,the system then identifies labels and dimensions for each cell. For example, the cells in row 2 would have the label“VWBug”. With this information assigned, the system can then check to ensure that formulas are dimension and labelcorrect. WhenthesystemcheckstheformulainF2itcanidentifythatitismultiplyingacell,E2,with theunitGallonsandthelabel“VWBug”withthecellD2,whichhastheunit“Miles/Gallon”andthelabel “Camry”. Whilethedimensionsworkoutinthisformula,thesystemwillidentifyaninconsistencywiththe labelsandbeabletoreportthistotheuser,asshowninFigure2. ThecellsF4andF5areshadedyellowto indicateproblemscausedbythepropagationoferrors. A B C D E F G 1 Car Miles Gallons MPG Total Gallons Miles Possible 2 VW Bug 180 5 =B2/C2 12 =E2*D3 3 Camry 225 5 =B3/C3 10 =E3*D3 4 BMW 300 5 =B4+C4 15 =E4*D4 5 =SUM(B2:B4) =MAX(F2:F4) 6 7 Error: Error: 8 Miles and Gallons can not The labels in this formula 9 be added do not match 10 11 Figure2: ResultsforExampleSpreadsheet 3 Representations for Spreadsheet Analysis Inthissection, wewillformalizethenotionsofspreadsheets, dimensions, andlabelsinpreparationforthe formalrulesystemthatisdiscussedinSection4.4. 3.1 SpreadsheetSyntax Spreadsheets (S) are functions that is a map addresses (a∈A) to expressions (e), in particular, S(a) yields theexpressionstoredataddressainthespreadsheetS. Expressionscanbevalues(v)orreferencestoother 4 cells(↑a),orareconstructedusingarithmetic(+or∗),aggregating(count),orconditionaloperators. e ::= v|↑a|e+e|e∗e|count(e,...,e)|if(e,e,e) Theoperations+and∗represent,respectively,awholeclassofadditiveoperators(including−andMAX) andmultiplicativeoperators(including/). 3.2 Dimensions A dimension (d) is given by a set of dimension components (c). Each component is given by a base (b), a conversionfactor(f),andanintegerexponent(n). Adimensioncomponentcanalsobeadimensionvariable (δ). Theidentitydimension{}isusedfordimensionlessvalues. d ::= {c,...,c} c ::= bn |δ f Foreachbasedimensionweidentifyadefaultunitwithfactor1. Forexample,thedefaultforlengthismeter (m),thatis,m=length1,whichalsomeansthatcm=length1 andft=length1 . Ingeneral,thefollowing 1 0.01 0.3 relationshipholds(wherexisadimensionlessnumberandbisanarbitrarybase): xbn =xfbn. f 1 In general, the choice of dimensions is arbitrary and depends on the application. For the task of analyzing dimensions in arbitrary spreadsheets, we have chosen the seven base units from the International System of Units (SI) [10], the modern metric system of measurement used throughout the world. To discover any dimensions that were not covered by these seven base units, a script was run over the EUSES corpus [11] whichidentifiedtwonewbasedimensions,MoneyandAngle. Combinedtheseninebasedimensionswere abletocoveralldimensionsfoundintheEUSEScorpus. Examplesofcompositedimensionsarespeed,measuredinm/s,thatis{length,time−1},orforce,mea- suredinkgm/s2,whichis{mass,length,time−2}. Therelationshipbetweenbasicandderiveddimensionsandunitsisillustratedwithseveralexamplesin Table2. Defaultunitsareshownasbold. Whilethisapproachofrepresentingconversionsbyasimplefactorisnotgeneralenoughtocoversome conversions,specificallydegreesFahrenheittodegreesCelsius,itkeepstheunificationofdimensionsfeasi- bleandworksinmostcases. Previouswork[4]hasshownthatthisrestrictionisnottoosevereasonly2out of487spreadsheetscontaineddimensionsthatcouldnotbeconvertedusingthemodelpresentedabove. 5 Table1: Basedimensionswithdefaultunits Quantity DefaultUnit length meter(m) mass kilogram(kg) time second(s) electriccurrent ampere(A) temperature kelvin(K) amountofsubstance mole(mol) luminousintensity candela(cd) money dollar($) angle degree(deg) Table2: Basicandderiveddimensionsandcorrespondingunits Dimension Units basic length m,cm,km,ft,... time s,min,hr,... mass kg,pounds,... ... ... derived speed m, km,... s hr force kgm =Newton,dyne,... s2 pressure kg =Pascal,psi,atm,... ms2 ... ... 3.3 Labels Thelabelingstructureinthisintegratedsystemisasimplifiedversionoftheformalmodelpresentedin[12]. In particular, since labels will be used only for one axis, we can omit the concept of AND labels, which leadstomuchsimplifiedrulesforcombininglabels. Thesyntaxthatweuseforlabelsisshownbelow. (NotethedifferencebetweenanOR-label(cid:96) |(cid:96) and 1 2 theverticalbar|toseparategrammaralternatives.) (cid:96) ::= v|(cid:96) [(cid:96) ]|(cid:96) |(cid:96) |1 1 2 1 2 To show the different possible types of labels we will look at several cells in Figure 1. A chain of labels, (cid:96) [(cid:96) ], represents cells that may have a second level label. In this example, the cell B3 has the header 1 2 A3, which contains the label “Camry”. Since A3 has as its header the cell A1 with label “Car”, the label associatedwithB3is“Car[Camry]”. 6 An OR label is used when cells with labels are added together. In general, when two cells are added together, their labels are ORed to produce a resulting label. The cell B5 is a SUM, which adds the three value cells in column B. The three labels used in this formula are “Car[VW Bug]”, “Car[Camry]”, and “Car[BMW]”. The resulting label is “Car[VWBug]|Car[Camry]|Car[BMW]”. Since OR distributes over label chains [12], we can factor this expression to “Car[VW Bug|Camry|BMW]”. This label expression canbegeneralizedto“Car” [6]. The ADD rule requires a compatibility of labels expressed by the following label simplification rule, (cid:96) (cid:111) (cid:96)→(cid:96),definedasfollows. (cid:96) (cid:111) (cid:96)→(cid:96) (cid:96) (cid:111) (cid:96)[(cid:96) ]→(cid:96) 1 (cid:96)[(cid:96) ] (cid:111) (cid:96)→(cid:96) 1 This operation works only for specific arguments, and if it fails in the premise of an inference rule, then a labelerrorhasbeenidentified. 3.4 OnSemanticvs.SyntacticLabelAnalysis Before we describe our integrated reasoning tool, we want to point out a principal difference of the two underlyingapproaches,becauseeventhoughbothUCheckanddimensioninferenceuselabelstodetermine errors,theyusethisinformationinquitedifferentways. UCheckessentiallyexploitstherelativepositionoflabels,butitdoesnotactuallyinterpretlabels. This meansthatwecanrenamelabelswithoutchangingthefunctionalityofthesystem(atleastiftherenaming isdonesystematically). Forexample,thelabelsCamryorBMWhavenomeaningtothesystemandcould bereplacedbyanyotherstrings. Itishowtheselabelsarecombinedwithotherlabelsinformulasthatforms thebasisoferrordetection. Dimensioninference,ontheotherhand,derivessomemeaningfromlabels. Insteadoftreatinglabelsas simplestrings,labelsindimensioninferenceactuallyareinterpretedtohavesomeextrinsicsemantics. This meansthatrenamingalabelcouldcauseerrorsinaspreadsheet. UsingthesheetinFigure1,ifthelabelin cellC1isrenamedto“Hours”,theformulasincolumnDwillnotbedimensioncorrectasthelabelforthat columnis“Miles/Gallon”,buttheresultingdimensionis“Miles/Hours”. Tokeeptheformuladimension correctthelabelsinD1andE1wouldhavetoberenamedaswell. The simple act of changing these three labels, to the sheet shown in Figure 3, now changes the entire meaningofthespreadsheet. Despitethefactthatnovalueswerechangedthedatanowrepresentssomething entirely different, as the the spreadsheet is now calculating how many miles a car can drive based on the speedmeasuredoverashortersampledistance. Thissimpleexampleshowshowdimensioninferenceuses themeaningsoflabelstogatherinformationandhowalabelchangecanmakealargeimpactonthemeaning 7 ofaspreadsheet. A B C D E F 1 Car Miles Hours MPH Total Hours Miles Possible 2 VW Bug 180 5 =B2/C2 12 =E2*D3 3 Camry 225 5 =B3/C3 10 =E3*D3 4 BMW 300 5 =B4+C4 15 =E4*D4 5 =SUM(B2:B4) =MAX(F2:F4) 6 Figure3: AnExampleDemonstratingtheMeaningofLabels 4 Integrated Label and Dimension Analysis Integrated label and dimension analysis (or LD analysis for short), while strongly based on both UCheck anddimensioninference,hasbeendesignedasanadd-ontodimensioninference. Asanadd-on,LDanalysis goesthroughmuchthesameprocessasdimensioninference,withafewchanges. Thefivedistinctstepsthat the LD analysis performs while checking a spreadsheet are shown below and discussed in greater detail in thissection. 1. Headerinference 2. Labelinterpretation 3. Identifydimensionandlabelaxes 4. Combinedlabelanddimensioninference 5. Dimensioninstantiation Header inference, label interpretation, and dimension instantiation are components that we have adopted unchanged from our previous work [4], and they are therefore only briefly described here. Steps 3 and 4 willbedescribedingreaterdetailinthefollowing. Thelaststepappliesonlyinthosecaseswhenthestep4 produces underspecified dimensions, that is, when it results in inferred dimensions that contain dimension variables. 4.1 HeaderInference Headerinferenceanalyzesthestructureofaspreadsheetandreturnsasetofheadersforeachcell. Aheader issimplytheaddressofanothercell. Therefore,headerinferenceproducesabinaryrelationH⊆A×Asuch that (a,a(cid:48))∈H says that a(cid:48) is a header of a. In general, one cell can be a header for many cells, and any particularcellcanhavezero,one,ormoreheaders. 8 Figure4showstheidentifiedheaders,outlinedinblue,fortheexampleinFigure1. Lookingatspecific cells, B1 is a header for B2, B3, and B4, that is, H−1(B1)={B2,B3,B4}, and A2 and B1 are headers of B2,thatis,H(B2)={A2,B1}. ThesespecificrelationshipsareshownwiththearrowsinFigure4. A B C D E F 1 Car Miles Gallons MPG Total Gallons Miles Possible 2 VW Bug 180 5 =B2/C2 12 =E2*D3 3 Camry 225 5 =B3/C3 10 =E3*D3 4 BMW 300 5 =B4+C4 15 =E4*D4 5 =SUM(B2:B4) =MAX(F2:F4) 6 Figure4: ExampleHeaders Header inference essentially works by analyzing the spatial relationships between different kinds of formulas. It can also take into account layout information. Techniques for header inference have been describedindetailelsewhere[13,6]. Inthecontextofthispaperwesimplyreusethosetechniques. 4.2 LabelInterpretation In the second phase of LD analysis we try to derive a dimension for each label contained in a cell that hasbeenidentifiedasaheaderbyheaderinference. Thisprocessworksby(a)splittinglabelsintoseparate words,(b)removingwordinflections,(c)mappingwordstemstodimensions,and(d)combiningdimensions intoonedimension. Forexample,cellE1inFigure1isaheadercellandisthereforesubjecttolabelinterpretation. Itsvalue can be split into the two words “Total” and “Gallons”, and the plural of “Gallons” can be removed. The resulting “Gallon” can then be mapped to the dimension “Gallon”. In contrast, “Total” cannot be mapped into any dimension and will thus be mapped to {}. Finally, the combination of both dimensions yields “Gallons”. Ifnopartofaheaderlabelcanbemappedtoadimensionotherthan{},thelabelismappedtoa dimensionvariableδ,whichindicatesthatthedimensionisatthistimeunknown. 4.3 IdentifyDimensionandLabelAxes Thegoalofoursystemistoexploitoneaxisfordimensioncheckingandtheotherforlabelchecking. Inthe formalrulesystem,thisseparationofanalysisisreflectedbytwonewjudgments,S,L(cid:96)a:(cid:96)andS,D(cid:96)a:d. These judgments specify how the integrated system gets labels and dimensions for each cell. The header relationshipH identifiedintheheaderinferencephasehastobepartitionedintotwopartsH=L∪Dwhere: • Listhesetofheadersthatdefinelabels 9 • Disthesetofheadersthatdefinedimensions Ifacellhastwoheaders, onedefiningdimensionsandtheotherdefiningaplainlabel, bothofthesepieces ofinformationareexploitedtomaketheinferencestronger,ascanbeseenintheCOMBOHDRruleinFigure 6. Tofacilitatethepartitioningoftheheaderrelationship,wehavetoidentifytableregionsinaspreadsheet and for each table its horizontal and vertical label axes. The information provided by label interpretation allowsustotheidentifythefollowingthreecasesforaxes: 1. NoDimensionAxis 2. OneDimensionAxis 3. TwoDimensionAxes In the following we will describe each of the possibilities in some detail. To assist in this explanation Figure5,whichshowsasimplespreadsheetcontainingthreetables,willbeused. Eachofthethreetablesis highlightedandillustratesanexampleofoneofthethreepotentialconfigurationsoftheaxes. Figure5: AxisExamples No Dimension Axis In the simplest case, that is a spreadsheet that has no units of measurement, there will be no dimension axis. If this is the case, then there is no need to run dimension inference at all. The previoussystemwouldnothavebeenabletodetectanyerrors. However,withtheintegrationoflabel-based reasoningwecanrunUChecktodetectlabelerrors. Whilethisisnotthegoalofthesystem,itdoesgivethe toolafunctionforspreadsheetswithoutdimensions. AnexampleofthisisshowninthefirsttableinFigure5. Thistablecontainsnounitsofmeasurements inanyoftheaxes,andifdimensionanalysiswererun,itwouldgarnernoresults. OneDimensionAxis Thesituationwherethissystemismostusefuliswhenthereisoneaxisthatcontains dimensions in a spreadsheet. If this is the case, the system will be able to combine label-based reasoning withdimensioninference. Thesystemwillusetheidentifieddimensionandlabelaxestoassignlabelsand dimensionsforeachcellinaspreadsheet. Thisinformationcanthenbeusedtodetecterrors. 10

Description:
Jul 10, 2010 Second, labels can be interpreted as units of measurements to provide Key Words: Spreadsheet, Dimension, Unit of Measurement, Static Analysis, [3] Tudor Antoniu, Paul A. Steckler, Shriram Krishnamurthi, Erich
See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.