A set is used for model generalization: selection, AbstractionofGeodatabases (re-)classification,aggregation,andareacollapse. Sometimes, also the reduction in the number of MonikaSester pointstorepresentageometricfeatureisapplied InstituteofCartographyandGeoinformatics, inthemodelgeneralizationprocess,althoughthis LeibnizUniversityofHannover,Hannover, is mostly considered a problem of cartographic Germany generalization.This is achieved by line general- izationoperations. Synonyms HistoricalBackground Cartographic generalization; Conceptual gener- alization of databases; Geographic data reduc- Generalizationisaprocessthathasbeenapplied tion; Model generalization; Multiple resolution by human cartographers to generate small scale database maps from detailed ones. The process is com- posedofanumberofelementaryoperationsthat havetobeappliedinaccordancewitheachother Definition inordertoachieveoptimalresults.Thedifficulty is the correct interplay and sequencing of the Model generalization is used to derive a more operations,whichdependsonthetargetscale,the simple and more easy to handle digital repre- type of objects involved, as well as constraints sentationofgeometricfeatures(Grünreich1995). these objects are embedded in (e.g., topological It is being applied mainly by National Map- constraints,geometricandsemanticcontext,...). ping Agencies to derive different levels of rep- Generalization is always subjective and requires resentationswithlessdetailsoftheirtopographic the expertise of a human cartographer (Spiess datasets,usuallycalledDigitalLandscapeMod- 1995). In the digital era, attempts to automate els(DLM’s).Modelgeneralizationisalsocalled generalization have lead to the differentiation geodatabase abstraction, as it relates to generat- between model generalization and cartographic ing a more simple digital representation of ge- generalization, where the operations of model ometric objects in a database, leading to a con- generalizationare considered to be easier to au- siderabledatareduction.Thesimplificationrefers tomatethanthoseofcartographicgeneralization. to both the thematic diversity and the geometric After model generalization has been applied, complexityoftheobjects.Amongthewellknown the thematic and geometric granularity of the mapgeneralizationoperationsthefollowingsub- data set corresponds appropriately to the target ©SpringerInternationalPublishingSwitzerland2016 S.Shekharetal.(eds.),EncyclopediaofGIS, DOI10.1007/978-3-319-23519-6_13-2 2 AbstractionofGeodatabases scale. However, there might be some geometric area-to-line reduction, the use of the Medial conflicts remaining that are caused by applying Axisis popular,which isdefinedas the locusof signatures to the features as well as by impos- points that have more than one closest neighbor ingminimumdistancesbetweenadjacentobjects. on the polygon boundary. There are several Theseconflictshavetobesolvedbycartographic approximations and special forms of axes (e.g., generalization procedures, among which typifi- Straight Skeleton (David and Erickson 1998)). cation and displacement are the most important Depending on the object and the task at hand, (for a comprehensive overview, see Mackaness thereareformsthatmaybe morefavorablethan et al. 2007). As opposed to cartographic gen- others (e.g., Chin et al. 1995 and Haunert and eralization, modelgeneralization processes have Sester2007). already achieved a high degree of automation. Fully automatic processes are available that are Aggregation abletogeneralizelargedatasets,e.g.,thewhole This is a very important operation that merges ofGermany(UrbankeandDieckhoff2006). twoormoreobjectsintoasingleone,thuslead- ing to a considerable amount of data reduction. Aggregationisoftenfollowingaselectionorarea ScientificFundamentals collapseprocess:whenanobjectistoosmall(or unimportant)to be presented in the target scale, Operationsofmodelgeneralizationareselection, it has to be merged with a neighboring object. re classification, aggregation, area collapse, and For the selection of the most appropriate neigh- linesimplification. bor,therearedifferentstrategies(seeFig.1,e.g., selectingtheneighboraccordingtothematicpri- Selection orityrules,theneighborwiththelongestcommon According to a given thematic and/or geometric boundary,thelargestneighbor,ortheareacanbe property, objects are selected which are being distributedequallytotheneighbors(Haunertand preserved in the target scale. Typical selection Sester2007;vanOosterom1995;Podrenek2002; criteria are object type, size or length. Objects vanSmaalen2003).Anothercriterionistoselect fulfillingthesecriteriaarepreserved,whereasthe a neighborwhichleadsto a compactaggregated others are discarded. In some cases, when an region and solve the whole problem as a global area partitioning of the whole data set has to optimizationprocess(HaunertandWolff2006). be preserved, then the deleted objects have be Aggregationcan also be performedwhen the replacedappropriatelybyneighboringobjects. objectsare nottopologicallyadjacent.Then, ap- propriate criteria for the determination of the Re-Classification neighborhoodare neededas wellas measuresto Often,thethematicgranularityofthetargetscale fill the gaps between the neighboring polygons is also reduced when reducing the geometric (Bundy et al. 1995). Aggregation can also be scale. This is realized by reclassification or new appliedtoothergeometricfeaturessuchaspoints classificationofobjecttypes.Forexample,inthe and lines. This leads to point aggregations that German ATKIS system, when going from scale can be approximated by convex hulls, or to ag- 1:25.000to 1:50.000,the variationof settlement gregationsoflinesfeatures. structures is reduced by merging two different settlementtypestooneclassinthetargetscale. LineSimplification AreaCollapse Line simplification is a very prominent gener- Whengoingtosmallerscales,higher-dimensional alization operation. Many operations have been objects may be reduced to Lower dimensional proposed, mainly taking the relative distance ones. For instance, a city represented as an area betweenadjacentpointsandtheirrelativecontext is reduced to a point; an areal river is reduced into account. The most well-known operator is to a linear river object. These reductions can the Douglas-Peucker-Algorithm (Douglas and be achieved using skeleton operations. For the Peucker1973). AbstractionofGeodatabases 3 Definition Neighbours Size All neighbors A dark object -> object -> object -> object -> equal distribution light gray object max_neighbors biggest neighbor to all neighbors AbstractionofGeodatabases,Fig.1 Differentaggregationmethods KeyApplications the necessary information is communicated to the user, database abstraction methods are used. The key application of database abstraction or Also, it allows for the progressive transmission model generalization is the derivation of less ofmoreandmoredetailedinformation(Brenner detaileddatasetsfordifferentapplications. andSester2005;Yang2005). CartographicMapping SpatialDataAnalysis The production of small scale maps requires a Spatialanalysisfunctionsusuallyrelatetoacer- detailed data set to be reduced in number and tain level of detail where the phenomena are granularityoffeatures.Thisreductionisachieved bestobserved,e.g.,forplanningpurposes,ascale using database abstraction.It hasto be followed of approximately 1:50.000 is very appropriate. by cartographic generalization procedures that Databaseabstractioncanbeusedtogeneratethis areappliedinordertogeneratethefinalsymbol- scale from base data sets. The advantage is that izedmapwithoutgraphicalconflicts. thelevelofdetailisreducedwhilestillpreserving thegeometricaccuracy. VisualizationonSmallDisplays The size of mobile display devices requires the presentation of a reduced number of features. FutureDirections To this end, the data can be reduced using data abstractionprocesses. MRDB:MultipleResolutionDatabase For topographicmapping,often data sets of dif- InternetMapping:Streaming ferent scales are provided by Mapping Agen- Generalization cies. In the past, these data sets were typically Visualization of maps on the internet requires produced manually by generalization processes. thetransmissionofanappropriatelevelofdetail With the availability of automaticgeneralization to the display of the remote user. To achieve tools,suchmanualeffortcanbereplaced.Inorder an adequatedata reductionthat still ensuresthat tomakeadditionaluseofthislatticeofdatasets, 4 AbstractionofGeodatabases thedifferentscalesarestoredinadatabasewhere methodologyandpractice.Taylor&Francis,London, theindividualobjectsinthedifferentdatasetsare pp106–119 ChinFY,SnoeyinkJ,WangCA(1995)Findingthemedial connected with explicit links. These links then axis of a simplepolygon inlinear time.In: Springer allowforanefficientaccessofthecorresponding (ed) ISAAC 95: proceedings of the 6th international objects in the neighboring scales, and thus an symposium onalgorithms andcomputation, London, ease of movement up and down the different pp382–391 DavidE,EricksonJ(1998)Raisingroofs,crashingcycles, scales. There are several proposals for appro- andplayingpool: applicationsofadatastructure for priate MRDB data structures, see e.g., Balley findingpairwiseinteractions.In:SCG98:proceedings et al. (2004). The links can be created either in of the 14th annual symposium on computational ge- the generalization process or by matching ex- ometry,Minneapolis,pp58–67 Douglas D,Peucker T(1973) Algorithmsfor thereduc- isting data sets (Hampe et al. 2004). Although tion of the number of points required to represent a different approaches already exist, there is still digitizedlineoritscaricature.CanCartogr10(2):112– researchneededtofullyexploitthisdatastructure 122 (Sheerenetal.2004). Grünreich D (1995) Development of computer-assisted generalizationonthebasisofcartographicmodelthe- ory. In:MüllerJC, Lagrange JP, WeibelR(eds) GIS DataUpdate andgeneralization–methodologyandpractice.Taylor An MRDB in principle offers the possibility of &Francis,London,pp47–55 efficientlykeepingtheinformationinlinkeddata Hampe M, Sester M, Harrie L (2004) Multiple repre- sentationdatabasestosupportvisualisationonmobile sets up-to-date. The idea is to exploit the link devices.In:Internationalarchivesofphotogrammetry, structure and propagatethe updated information remote sensing and spatial information sciences, IS- to the adjacentand linked scales. There are sev- PRS,Istanbul,vol35 eralconceptsforthis,however,thechallengeisto Haunert JH, Sester M (2005) Propagating updates be- tweenlinkeddatasetsofdifferentscales.In:Proceed- restrict the influence rangeto a manageablesize ingsof22ndinternationalcartographicconference,La (HaunertandSester2005). Coruna,pp9–16 HaunertJH,SesterM(2007,inpress)Areacollapseand roadcenterlinesbasedonstraightskeletons.Geoinfor- matica Cross-References HaunertJH,WolffA(2006)Generalizationoflandcover mapsbymixedintegerprogramming.In:Proceedings (cid:2)Generalization,On-the-Fly of14thinternationalsymposiumonadvancesingeo- (cid:2)HierarchiesandLevelofDetail graphicinformationsystems,Arlington Mackaness WA, Sarajakoski LT, Ruas A (2007) Gen- (cid:2)MapGeneralization eralisation of geographic information: cartographic (cid:2)MobileUsageandAdaptiveVisualization modelling and applications. Published on behalf of (cid:2)VoronoiDiagram theinternationalcartographicassociationbyElsevier, (cid:2)WebMappingandWebCartography Amsterdam MüllerJC,Lagrange JP,WeibelR(eds) (1995)GISand generalization –methodologyandpractice.Taylor & Francis,London References Podrenek M (2002) Aufbau des DLM50 aus dem Ba- sisDLM und Ableitung der DTK50 – Lösungsansatz inNiedersachsen. Kartographische Schriften Band6. Balley S, Parent C, Spaccapietra S (2004) Modelling KirschbaumVerlag,Bonn,pp126–130 geographic data with multiple representation. Int J Sheeren D, Mustière S, Zucker JD (2004) Consistency GeogrInfSci18(4):327–352 assessment between multiple representations of geo- Brenner C, Sester M (2005) Continuous generalization graphical databases: a specification-based approach. for small mobile displays. In: Agouris P, Croitoru A In: Proceedings of the 11th international symposium (eds) Next generation geospatial information. Taylor onspatialdatahandling,Leicester &Francis,Hoboken,pp33–41 SpiessE(1995)Theneedforgeneralizationinagisenvi- Bundy G, Jones C, Furse E (1995) Holistic generaliza- ronment.In:MüllerJC,LagrangeJP,WeibelR(eds) tion of large-scale cartographic data. In: Müller JC, GIS and generalization – methodology and practice. LagrangeJP,WeibelR(eds)GISandgeneralization– Taylor&Francis,London,pp31–46 AbstractionofGeodatabases 5 Urbanke S, Dieckhoff K (2006) The adv-project atkis van Smaalen J (2003) Automated aggregation generalization,partmodelgeneralization(inGerman). of geographic objects. A new approach to KartographischeNachrichten56(4):191–196 the conceptual generalisation of geographic A vanOosteromP(1995)Thegap-tree,anapproachto‘on- databases. PhD thesis, Wageningen University, the-fly’ map generalization of an area partitioning. TheNetherlands In: MüllerJC, Lagrange JP, Weibel R(eds) GISand Yang B (2005) A multi-resolution model of vector map generalization–methodology andpractice. Taylor& dataforrapidtransmissionovertheinternet.Comput Francis,London,pp120–132 Geosci31(5):569–578 A approximateanswer, but continuouslyrefine the AggregateQueries,Progressive answerastimegoeson,progressivelyimproving Approximate itsquality.Thus,iftheuserhasafixeddeadline, hecanobtainthebestanswerwithintheallotted IosifLazaridisandSharadMehrotra time; conversely, if he has a fixed answer accu- DepartmentofComputerScience,Universityof racy requirement, the system will use the least California,Irvine,CA,USA amountoftimetoproduceananswerofsufficient accuracy. Thus, progressive approximate aggre- gate queries are a flexible way of implementing Synonyms aggregatequeryanswering. Multi-Resolution Aggregate trees (MRA- Approximate aggregate query; On-line trees) are spatial – or in general multi- aggregation dimensional – indexing data structures, whose nodes are augmented with aggregate values for alltheindexedsubsetsofdata.Theycanbeused Definition very efficiently to providean implementationof progressiveapproximatequeryanswering. Aggregatequeriesgenerallytakeasetofobjects as input and produce a single scalar value as output,summarizingoneaspectoftheset.Com- HistoricalBackground monlyusedaggregatetypesincludeMIN,MAX, AVG,SUM,andCOUNT. Aggregate queries are extremely useful because If the input set is very large, it might not theycan summarizea hugeamountof data by a be feasible to compute the aggregate precisely singlenumber.Forexample,manyusersexpectto andinreasonabletime.Alternatively,theprecise knowtheaverageandhighesttemperatureintheir value of the aggregate may not even be needed city and are not really interested in the temper- by the application submitting the query, e.g., if ature recorded by all environmental monitoring the aggregate value is to be mapped to an 8-bit stations used to produce this number. The sim- color code for visualization. Hence, this moti- plest aggregate query specifies a selection con- vates the use of approximate aggregate queries, dition specifying the subset of interest, e.g., “all whichreturnavalueclosetotheexactone,butat monitoring stations in Irvine” and an aggregate afractionofthetime. typetobecomputed,e.g.,“MAXtemperature”. Progressiveapproximateaggregatequeriesgo The normal way to evaluate an aggregate one step further. They do not produce a single query is to collect all data in the subset ©SpringerInternationalPublishingSwitzerland2016 S.Shekharetal.(eds.),EncyclopediaofGIS, DOI10.1007/978-3-319-23519-6_41-2 2 AggregateQueries,ProgressiveApproximate of interest and evaluate the aggregate query off-line synopses, MRA-trees are flexible and over them. This approach has two problems: can adapt to the characteristics of the user’s first, the user may not need to know that the quality/time requirements. Their advantage over temperature is 34:12ıC, but 34 ˙ 0:5ıC will samplingisthattheyhelpqueriesquicklyzeroin suffice; second, the dataset may be so large onthesubsetofinterestwithouthavingtoprocess that exhaustive computation may be infeasible. agreatnumberoftuplesindividually.Moreover, These observations motivated researchers to MRA-treesprovidedeterministic answer quality devise approximate aggregate query answering guarantees to the user that are easy for him mechanisms. to prescribe (when he poses his query) and to Off-linesynopsisbasedstrategies,suchashis- interpret(whenhereceivestheresults). tograms (Ioannidis and Poosala 1999), samples (Acharyaetal.1999),andwavelets(Chakrabarti et al. 2000) have been proposed for approx- ScientificFundamentals imate query processing. These use small data summaries that can be processed very easily to Multi-dimensional index trees such as R-trees, answer a query at a small cost. Unfortunately, quad-trees, etc., are used to index data exist- summaries are inherently unable to adapt to the ing in a multi-dimensional domain. Consider a queryrequirements.Theuserusuallyhasnoway d-dimensionalspaceRd andafinitesetofpoints of knowing how good an approximate answer (input relation) S (cid:2) Rd. Typically, for spatial is and, even if he does, it may not suffice for applications, d 2 f2;3g. The aggregate query his goals. Early synopsis based techniques did is defined as a pair (agg, RQ ) where agg is not provide any guarantees about the quality an aggregate function (e.g., MIN, MAX, SUM, of the answer, although this has been incor- AVG, COUNT) and RQ (cid:2) Rd is the query porated more recently (Garofalakis and Kumar region.The queryasks forthe evaluationof agg 2005). overalltuplesinS thatareinregionRQ.Multi- Online aggregation (Hellerstein et al. 1997) dimensional index trees organize this data via a wasproposedtodealwiththisproblem.Inonline hierarchical decomposition of the space Rd or aggregation, the input set is sampled continu- grouping of the data in S. In either case, each ously,aprocesswhichcan,inprinciple,continue nodeN indexesasetofdatatuplescontainedin until this set is exhausted,thus providingan an- its subtree which are guaranteed to have values swerofarbitrarilygoodquality;thegoalis,how- withinthenode’sregionRN. ever, to use a sample of small size, thus saving MRA-trees(LazaridisandMehrotra2001)are on performance while giving a “good enough” genericdata techniquesthat can be appliedover answer. In online aggregation, a running aggre- any standard multi-dimensional index method; gate is updated progressively,finally converging theyarenotyetanotherindexingtechnique.They totheexactansweriftheinputisexhausted.The modifytheunderlyingindexbyaddingthevalue sampling usually occurs by sampling either the of the agg over all data tuples indexed by (i.e., entire datatable ora subsetofinterestonetuple in the sub-tree of)N to each tree nodeN. Only at a time; this may be expensive, depending on a single such value, e.g., MIN, may be stored, the size of the table, and also its organization: but in general, all aggregate types can be used if tuples are physically ordered in some way, without much loss of performance.An example then sampling may need to be performed with ofanMRA-quad-treeisseeninFig.1. randomdiskaccesses,whicharecostiercompared Thekeyobservationbehindtheuse ofMRA- tosequentialaccesses. trees is that the aggregatevalue of all the tuples Multi-resolutiontrees(LazaridisandMehrotra indexedbyanodeN isknownbyjustvisitingN. 2001) were designed to deal with the limita- Thus, in addition to the performance benefit of tions of established synopsis-based techniques a standard spatial index (visiting only a fraction and sampling-based online aggregation. Unlike ofselected tuples, ratherthan the entire set), the AggregateQueries,ProgressiveApproximate 3 A AggregateQueries,ProgressiveApproximate,Fig.1 ExampleofanMRA-quad-tree AggregateQueries,ProgressiveApproximate,Fig.2 AsnapshotofMRA-treetraversal MRA-tree also avoids traversing the entire sub- performance. This situation is seen in Fig.2: tree of nodescontainedwithin the queryregion. nodesattheperimeterofthequery(setNp)can Nodes that partially overlap the region may or befurtherexplored,whereasnodesattheinterior may not contribute to the aggregate, depending (Nc)neednotbe. onthe spatialdistributionof pointswithinthem. The progressive approximation algorithm Such nodes can be further explored to improve (Fig.3)hasthreemajorcomponents: 4 AggregateQueries,ProgressiveApproximate AggregateQueries,ProgressiveApproximate,Fig.3 Progressiveapproximationalgorithm • Computation of a deterministic interval of Thedetailsofthisforalltheaggregatetypescan confidence guaranteed to contain the aggre- be found in Lazaridis and Mehrotra (2001). For gatevalue,e.g.,[30,40]. example,iftheSUMofallcontainednodesis50 • Estimationoftheaggregatevalue,e.g.,36.2. andtheSUMofallpartiallyoverlappingnodesis • A traversal policy which determines which 15,thentheintervalis[50,65]sinceallthetuples node to explore next by visiting its children in the overlappingnodescould either be outside nodes. orinsidethequeryregion. Thereisnosinglebestwayforaggregatevalue The interval of confidence can be calculated estimation. For example, taking the middle of by taking the set of nodes partially overlap- theintervalhasthe advantageofminimizingthe ping/containedinthequeryintoaccount(Fig.2). worst-caseerror.Ontheotherhand,intuitively,if AggregateQueries,ProgressiveApproximate 5 a node barely overlaps with the query, then it is to compute even the exact answer. Query se- expectedthatitsoverallcontributiontothequery lectivity affects processing speed; like all multi- will be slight. Thus, if in the previous example dimensionalindexes, performancedegradesas a A therearetwo partiallyoverlappingnodes,A and higher fraction of the input table S is selected. B, with SUM(A)D 5 and SUM(B)D 15, and However,unliketraditionalindexes,thedegrada- 30%ofAand50%ofBoverlapswiththequery tion is more gradual since the “interior” area of respectively, then a good estimate of the SUM thequeryregionisnotexplored.Atypicalprofile aggregatewillbe50C5(cid:3)0:3C15(cid:3)0:5D59. of answer error as a function of the number of Finally, the traversal policy should aim to nodesvisitedcanbeseeninFig.4. shrink the interval of confidence by the great- MRA-trees use extra space (to store the est amount, thus improving the accuracy of the aggregates) in exchange for time. If the under- answer as fast as possible. This is achieved by lying data structure is an R-tree, then storage organizingthe partially overlappingnodesusing of aggregates in tree nodes results in decreased apriorityqueue.Thequeueisinitializedwiththe fanoutsincefewerboundingrectanglesandtheir rootnodeandsubsequentlythefrontnodeofthe accompanying aggregate values can be stored queueisrepeatedlypicked,itschildrenexamined, withina diskpage.Decreasedfanoutmayimply theconfidenceintervalandaggregateestimateis increased height of the tree. Fortunately, the updated, and the partially overlapping children overheadofaggregatestoragedoesnotnegatively are placedinthe queue.Our examplemayshow affect performance since it is counter-balanced thepreferencetoexplorenodeBbeforeAsinceit by the benefits of partial tree exploration. Thus, contributedmore(15)totheuncertaintyinherent evenforcomputingtheexactanswer,MRA-trees intheintervalofconfidencethanB(5).Detailed are usually faster than regular R-trees and the descriptionsof thepriorityusedforthedifferent differencegrowsevenifasmallerror,e.g.,inthe aggregate types can be found in Lazaridis and orderof10%,isallowed(Fig.5). Mehrotra(2001). Performance of MRA-trees depends on both the underlying data structure used as well as KeyApplications the aggregate type and query selectivity. MIN and MAX queries are typically evaluated very Progressiveapproximateaggregatequeriesusing efficientlysincethequeryprocessingsystemuses a multi-resolution tree structure can be used in the node aggregates to quickly zero in on a many application domains when data is either few candidate nodes that contain the minimum large, difficultto process, or the exact answer is value; very rarely is the entire perimeter needed notneeded. AggregateQueries, Relative Error (COUNT, 25%) Progressive 1.4 Approximate,Fig.4 Answererrorimprovesas 1.2 moreMRA-treenodesare or visited Err 1 e ativ 0.8 el e R 0.6 g a er 0.4 v A 0.2 0 0 100 200 300 400 500 600 # MRA-tree Nodes Visited