Lecture Notes in Computer Science 3484 CommencedPublicationin1973 FoundingandFormerSeriesEditors: GerhardGoos,JurisHartmanis,andJanvanLeeuwen EditorialBoard DavidHutchison LancasterUniversity,UK TakeoKanade CarnegieMellonUniversity,Pittsburgh,PA,USA JosefKittler UniversityofSurrey,Guildford,UK JonM.Kleinberg CornellUniversity,Ithaca,NY,USA FriedemannMattern ETHZurich,Switzerland JohnC.Mitchell StanfordUniversity,CA,USA MoniNaor WeizmannInstituteofScience,Rehovot,Israel OscarNierstrasz UniversityofBern,Switzerland C.PanduRangan IndianInstituteofTechnology,Madras,India BernhardSteffen UniversityofDortmund,Germany MadhuSudan MassachusettsInstituteofTechnology,MA,USA DemetriTerzopoulos NewYorkUniversity,NY,USA DougTygar UniversityofCalifornia,Berkeley,CA,USA MosheY.Vardi RiceUniversity,Houston,TX,USA GerhardWeikum Max-PlanckInstituteofComputerScience,Saarbruecken,Germany Evripidis Bampis Klaus Jansen Claire Kenyon (Eds.) EfficientApproximation and Online Algorithms Recent Progress on Classical Combinatorial Optimization Problems and New Applications 1 3 VolumeEditors EvripidisBampis Universitéd’ÉvryVald’Essonne LaMI,CNRSUMR8042 523,PlacedesTerasses,TourEvry2,91000EvryCedex,France E-mail:[email protected] KlausJansen UniversityofKiel InstituteforComputerScienceandAppliedMathematics Olshausenstr.40,24098Kiel,Germany E-mail:[email protected] ClaireKenyon BrownUniversity DepartmentofComputerScience Box1910,Providence,RI02912,USA E-mail:[email protected] LibraryofCongressControlNumber:2006920093 CRSubjectClassification(1998):F.2,C.2,G.2-3,I.3.5,G.1.6,E.5 LNCSSublibrary:SL1–TheoreticalComputerScienceandGeneralIssues ISSN 0302-9743 ISBN-10 3-540-32212-4SpringerBerlinHeidelbergNewYork ISBN-13 978-3-540-32212-2SpringerBerlinHeidelbergNewYork Thisworkissubjecttocopyright.Allrightsarereserved,whetherthewholeorpartofthematerialis concerned,specificallytherightsoftranslation,reprinting,re-useofillustrations,recitation,broadcasting, reproductiononmicrofilmsorinanyotherway,andstorageindatabanks.Duplicationofthispublication orpartsthereofispermittedonlyundertheprovisionsoftheGermanCopyrightLawofSeptember9,1965, initscurrentversion,andpermissionforusemustalwaysbeobtainedfromSpringer.Violationsareliable toprosecutionundertheGermanCopyrightLaw. SpringerisapartofSpringerScience+BusinessMedia springeronline.com ©Springer-VerlagBerlinHeidelberg2006 PrintedinGermany Typesetting:Camera-readybyauthor,dataconversionbyScientificPublishingServices,Chennai,India Printedonacid-freepaper SPIN:11671541 06/3142 543210 Preface In this book, we present some recent advances in the field of combinatorial optimization focusing on the design of efficient approximation and on-line al- gorithms. Combinatorial optimization and polynomial time approximation are verycloselyrelated:givenanNP-hardcombinatorialoptimizationproblem,i.e., a problem for which no polynomial time algorithm exists unless P =NP, one important approach used by computer scientists is to consider polynomial time algorithmsthat do not produce optimum solutions,but solutions that are prov- ably close to the optimum. A natural partition of combinatorial optimization problems into two classes is then of both practical and theoretical interest: the problemsthatarefully approximable,i.e.,thoseforwhichthereisanapproxima- tion algorithm that can approach the optimum with any arbitrary precision in terms ofrelative errorandthe problems that arepartly approximable, i.e.,those for which it is possible to approachthe optimum only until a fixed factor unless P =NP. For some of these problems, especially those that are motivated by practical applications, the input may not be completely known in advance, but revealedduringtime.Inthiscase,knownastheon-linecase,thegoalistodesign algorithms that are able to produce solutions that are close to the best possible solution that can be produced by any off-line algorithm, i.e., an algorithm that knows the input in advance. Theseissueshavebeentreatedinsomerecenttexts1,butinthelastfewyears a huge amount of new results have been produced in the area of approximation andon-linealgorithms.Thisbookisdevotedtothestudyofsomeclassicalprob- lems of scheduling, of packing, and of graph theory, but also new optimization problems arising in various applications such as networks, data mining or clas- sification. One central idea in the book is to use a linear program relaxation of the problem, randomization and rounding techniques. The book is divided into 11 chapters. The chapters are self-contained and may be read in any order. In Chap. 1, the goal is the introduction of a theoretical framework for deal- ing with data mining applications. Some of the most studied problems in this area as well as algorithmic tools are presented. Chap. 2 presents a survey con- cerning local search and approximation. Local search has been widely used in the core of many heuristic algorithms and produces excellent practical results for many combinatorial optimization problems. The objective here is to com- 1 V.Vazirani,ApproximationAlgorithms,SpringerVerlag,Berlin,2001;G.Ausielloet al,ComplexityandApproximation:CombinatorialOptimizationProblemsandTheir Approximability, Springer Verlag, 1999; D. S. Hochbaum, editor, Approximation Algorithms for NP-Hard Problems, PWS Publishing Company, 1997; A. Borodin, R.El-Yaniv,On-lineComputation and Competitive Analysis, Cambridge University Press, 1998, A.Fiat and G. J. Woeginger, editors, Online Algorithms: The State of the Art, LNCS 1442. Springer-Verlag, Berlin, 1998. VI Preface pare from a theoretical point of view the quality of local optimum solutions with respect to a global optimum solution using the notion of the approxima- tion factor and to review the most important results in this direction. Chap. 3 surveys the wavelength routing problem in the case where the underlying op- tical network is a tree. The goal is to establish the requested communication connections but using the smallest total number of wavelengths. In the case of trees this problem is reduced to the problem of finding a set of transmitter- receiver paths and assigning a wavelength to each path so that no two paths of the same wavelengthsharethe same fiber link.Approximationandon-linealgo- rithms, as well as hardness results and lower bound, are presented. In Chap. 4, acall admission control problem isconsideredinwhichtheobjectiveisthe max- imization of the number of accepted communication requests. This problem is formalized as an edge-disjoint-path problem in (non)-oriented graphs and the most important (non)-approximability results, for arbitrary graphs, as well as forsomeparticulargraphclasses,arepresented.Furthermore,combinatorialand linearprogrammingalgorithmsarereviewedforageneralizationoftheproblem, the unsplittable flow problem. Chap. 5 is focused on a special class of graphs, the intersection graphs of disks. Approximation and on-line algorithms are pre- sented for the maximum independent set and coloring problems in this class. In Chap.6,ageneraltechniqueforsolvingmin-maxandmax-minresourcesharing problemsis presentedanditisappliedto twoapplications:schedulingunrelated machines and strip packing. In Chap. 7, a simple analysis is proposed for the on-line problem of scheduling preemptively a set of tasks in a multiprocessor setting in order to minimize the flow time (total time of the tasks in the sys- tem).InChap.8,approximationresultsarepresentedforageneralclassification problem, the labeling problem which arises in several contexts and aims to clas- sify related objects by assigning to each of them one label. In Chap. 9, a very efficient tool for designing approximation algorithms for scheduling problems is presented, the list scheduling in order of α-points, and it is illustrated for the single machine problem where the objective function is the sum of weighted completiontimes.Chap.10is devotedto the study ofone classicaloptimization problem,thek-medianproblemfromtheapproximationpointofview.Themain algorithmic approaches existing in the literature as well as the hardness results are presented. Chap. 11 focuses on a powerful tool for the analysis of random- ized approximation algorithms, the Lova´sz-Local-Lemma which is illustrated in two applications: the job shop scheduling problem and resource-constrained scheduling. We take the opportunity to thank all the authors and the reviewers for their important contribution to this book. We gratefully acknowledge the support from the EU Thematic Network APPOL I+II (Approximation and Online Al- gorithms). We also thank Ute Iaquinto and Parvaneh Karimi Massouleh from the University of Kiel for their help. September 2005 Evripidis Bampis, Klaus Jansen, and Claire Kenyon Table of Contents Contributed Talks On Approximation Algorithms for Data Mining Applications Foto N. Afrati ................................................ 1 A Survey of Approximation Results for Local Search Algorithms Eric Angel ................................................... 30 Approximation Algorithms for Path Coloring in Trees Ioannis Caragiannis, Christos Kaklamanis, Giuseppe Persiano...... 74 Approximation Algorithms for Edge-Disjoint Paths and Unsplittable Flow Thomas Erlebach ............................................. 97 Independence and Coloring Problems on Intersection Graphs of Disks Thomas Erlebach, Jiˇr´ı Fiala.................................... 135 Approximation Algorithms for Min-Max and Max-Min Resource Sharing Problems, and Applications Klaus Jansen................................................. 156 A Simpler Proof of Preemptive Total Flow Time Approximation on Parallel Machines Stefano Leonardi.............................................. 203 Approximating a Class of Classification Problems Ioannis Milis ................................................. 213 List Scheduling in Order of α-Points on a Single Machine Martin Skutella............................................... 250 Approximation Algorithms for the k-Median Problem Roberto Solis-Oba ............................................. 292 The Lova´sz-Local-Lemmaand Scheduling Anand Srivastav .............................................. 321 Author Index.................................................. 349 On Approximation Algorithms for Data Mining Applications Foto N. Afrati National Technical University of Athens,Greece Abstract. Weaimtopresentcurrenttrendsinthetheoreticalcomputer science research on topics which have applications in data mining. We briefly describe data mining tasks in various application contexts. We giveanoverviewofsomeofthequestionsandalgorithmicissuesthatare of concern when mining huge amounts of data that do not fit in main memory. 1 Introduction Data mining is about extracting useful information from massive data such as finding frequently occurring patterns or finding similar regions or clustering the data. The advent of the internet has added new applications and challenges to thisarea.Fromthealgorithmicpointofviewminingalgorithmsseektocompute good approximate solutions to the problem at hand. As a consequence of the huge size of the input, algorithms are usually restricted to making only a few passes over the data, and they have limitations on the random access memory they use and the time spent per data item. The input in a data mining task can be viewed, in most cases, as a two di- mensional m×n 0,1-matrix which often is sparse. This matrix may represent several objects such as a collection of documents (each row is a document and each column is a word and there is a 1 entry if the word appears in this doc- ument), or a collection of retail records (each row is a transaction record and each column represents an item, there is a 1 entry if the item was bought in this transaction), or both rows and columns are sites on the web and there is a 1 entry if there is a link from the one site to the other. In the latter case, the matrixis often viewedas agraphtoo.Sometimes the matrixcanbe viewedasa sequence of vectors (its rows) or even a sequence of vectors with integer values (not only 0,1). The performance of a data mining algorithm is measured in terms of the number of passes, the required work space in main memory and computation time per data item. A constant number of passes is acceptable but one pass al- gorithms are mostly sought for. The workspace available ideally is constant but sublinearspacealgorithmsarealsoconsidered.Thequalityofthe outputisusu- ally measured using conventional approximation ratio measures [97], although insomeproblemsthe notionofapproximationandthemannerofevaluatingthe results remain to be further investigated. E.Bampisetal.(Eds.):ApproximationandOnlineAlgorithms,LNCS3484,pp.1–29,2006. (cid:1)c Springer-VerlagBerlinHeidelberg2006 2 F.N. Afrati These performance constraints call for designing novel techniques and novel computational paradigms. Since the amount of data far exceeds the amount of workspace available to the algorithm, it is not possible for the algorithm to “remember” large amounts of past data. A recent approach is to create a summaryofthepastdatatostoreinmainmemory,leavingalsoenoughmemory for the processing of the future data. Using a random sample of the data is also another popular technique. Besides data mining, other applications can be also modeled as one pass problemssuchasthe interfacebetweenthe storagemanagerandtheapplication layer of a database system or processing data that are brought to desktop from networks,whereeachpassessentiallyisanotherexpensiveaccesstothenetwork. Severalcommunitieshavecontributed(withtechnicaltoolsandmethodsaswell asbysolvingsimilarproblems)totheevolvingofthedataminingfield,including statistics, machine learning and databases. Manysinglepassalgorithmshavebeendevelopedrecentlyandalsotechniques and tools that facilitate them. We will review some of them here. In the first part of this chapter (next two sections), we review formalisms and technical tools used to find solutions to problems in this area. In the rest of the chapter webrieflydiscussrecentresearchinassociation rules,clusteringandwebmining. Anassociationrulerelatestwocolumnsoftheentrymatrix(e.g.,ifthei-thentry of a row v is 1 then most probably the j-th entry of v is also 1). Clustering the rows of the matrix according to various similarity criteria in a single pass is a new challenge which traditional clustering algorithms did not have. In web mining, one problem of interest in search engines is to rank the pages of the web according to their importance on a topic. Citation importance is taken by popular search engines according to which important pages are assumed to be those that are linked by other important pages. Inmoredetailtherestofthechapterisorganizedasfollows.Thenextsection containsformaltechniquesusedforsinglepassalgorithmsandaformalismforthe datastreammodel.Section3containsanalgorithmwithperformanceguarantees for finding approximately the Lp distance between two data streams. As an example, Section 4 contains a list of what are considered the main data mining tasks and another list with applications of these tasks. The last three sections discussrecentalgorithmsdevelopedforfindingassociationrules,clusteringaset of data items and for searching the web for useful information. In these three sections,techniquesmentionedinthe beginningofthe chapterareused(suchas SVD,sampling)tosolvethe specificproblems.Naturallysomeofthetechniques are common, suchas, for example, spectralmethods are used in both clustering and web mining. As the area is rapidly evolving this chapter serves as a brief introduction to the most popular technical tools and applications. 2 Formal Techniques and Tools Inthis sectionwepresentsometheoreticalresultsandformalismsthatareoften used in developing algorithms for data mining applications. In this context, the On Approximation Algorithms for Data Mining Applications 3 singularvaluedecomposition(SVD)ofamatrix(subsection2.1)hasinspiredweb searchtechniques,and,asadimensionalityreductiontechnique,isusedforfind- ing similarities among documents or clustering documents (known as the latent semanticindexingtechniquefordocumentanalysis).Randomprojections(subsec- tion2.1)offeranothermeansfordimensionalityreductionexploredinrecentwork. Datastreams(subsection2.2)isproposedformodelinglimitedpassalgorithms;in thissubsectionsomediscussionisdoneonlowerandupperboundsontherequired workspace.Samplingtechniques(subsection2.3)havealsobeenusedinstatistics andlearningtheory,undersomewhatdifferentperspectivehowever.Storingasam- pleofthedatathatfitsinmainmemoryandrunninga“conventional”algorithm onthissampleisoftenusedasthefirststageofvariousdataminingalgorithms.We presentacomputationalmodelforprobabilisticsamplingalgorithmsthatcompute approximatesolutions.Thismodelisbasedonthedecisiontreemodel[27]andre- latesthequerycomplexitytothesizeofthesample. We start by providing some (mostly) textbook definitions for self contain- ment purposes. In data mining we are interested in vectors and their relation- ships under several distance measures. For two vectors,(cid:1)v = (v1,...,vn), (cid:1)u = (u1,...,un), the dot product orinner product is defined tobe anumber whichis equal to the sum of the component-wise products(cid:1)v·(cid:1)u=v1u1+...+vnun and theLp distance(orLp norm)isdefinedtobe:||(cid:1)v−(cid:1)u||p =(Σin=1|vi−ui|p)1/p.For p=∞, L∞ distance is equalto maxni=1|ui−vi|. The Lp distance is extended to bedefinedbetweenmatrices:||V(cid:1) −U(cid:1)||p =(Σi(Σj|Vij−Uij|p))1/p.Wesometimes use || || to denote || || . The cosine distance is defined to be 1− (cid:1)v·(cid:1)u . For 2 ||(cid:1)v|| ||(cid:1)u|| sparse matrices the cosine distance is a suitable similarity measure as the dot product deals only with non-zero entries (which are the entries that containthe information) and then it is normalized over the lengths of the vectors. Some results are based on stable distributions [85]. A distribution D over the reals is called p-stable if for any n real numbers a1,...,an and independent identically distributed, with distribution D, variables X1,...,Xn, the random variable ΣiaiXi has the same distribution as the variable (Σi|ai|p)1/pX, where X is a random variable with the same distribution as the variables X1,...,Xn. It is known that stable distributions exist for any p ∈ (0,2]. A Cauchy distri- bution defined by the density function 1 , is 1-stable,a Gaussian(normal) π(1+x2) distribution defined by the density function √1 e−x2/2, is 2-stable. 2π A randomized algorithm [81] is an algorithm that flips coins, i.e., it uses ran- dom bits, while no probabilistic assumption is made on the distribution of the input. A randomized algorithmis called Las-Vegas if it gives the correctanswer onallinputs.Itsrunningtimeorworkspacecouldbearandomvariabledepend- ing on the random variable of the coin tosses. A randomized algorithm is called Monte-Carlo with error probability (cid:3) if on every input it gives the right answer with probability at least 1−(cid:3). 2.1 Dimensionality Reduction Given a set S of points in the multidimensional space,dimensionality reduction techniquesareusedtomapS toasetS(cid:4) ofpoints inaspaceofmuchsmallerdi- 4 F.N. Afrati mensionality while approximately preserving important properties of the points inS.Usuallywewanttopreservedistances.Dimensionalityreductiontechniques can be useful in many problems where distance computations and comparisons are needed. In high dimensions distance computations are very slow and more- overitisknownthat,inthiscase,thedistancebetweenalmostallpairsofpoints is the same with high probability and almost all pair of points are orthogonal (known as the Curse of Dimensionality). Dimensionality reduction techniques that are popular recently include Ran- dom Projections and Singular Value Decomposition (SVD). Other dimensional- ity reduction techniques use linear transformations such as the Discrete Cosine transformorHaarWaveletcoefficientsortheDiscreteFourierTransform(DFT). DFT is a heuristic which is based on the observation that, for many sequences, most of the energy of the signal is concentrated in the first few components of DFT. The L distance is preservedexactly under the DFT and its implementa- 2 tion is also practically efficient due to an O(nlogn) DFT algorithm. Dimensionality reduction techniques are well explored in databases [51,43]. Random Projections. Random Projection techniques are based on the Johnson-Lindenstrauss (JL) lemma [67] which states that any set of n points canbe embedded into the k-dimensionalspace with k =O(logn/(cid:3)2) so thatthe distances are preservedwithin a factor of (cid:3). Lemma 1. (JL) Let (cid:1)v1,...,(cid:1)vm be a sequence of points in the d-dimensional space over the reals and let (cid:3),F ∈ (0,1]. Then there exists a linear mapping f from the points of the d-dimensional space into the points of the k-dimensional space where k = O(log(1/F)/(cid:3)2) such that the number of vectors which ap- proximately preserve their length is at least (1−F)m. We say that a vector (cid:1)vi approximately preserves its length if: ||(cid:1)vi||2 ≤||f((cid:1)vi)||2 ≤(1+(cid:3))||(cid:1)vi||2 Theproofofthelemma,however,isnon-constructive:itshowsthatarandom mapping induces small distortions with high probability. Severalversions of the proof exist in the literature. We sketch the proof from [65]. Since the mapping is linear,we canassume without loss of generalitythat the(cid:1)vi’s are unit vectors. Thelinearmappingf isgivenbyak×dmatrixA(cid:1) andf((cid:1)vi)=A(cid:1)(cid:1)vi,i=1,...,m. By choosingthe matrix A(cid:1) at randomsuchthat eachof its coordinates is chosen independently from N(0,1), then each coordinate of f((cid:1)vi) is also distributed according to N(0,1) (this is a consequence of the spherical symmetry of the normal distribution). Therefore, for any vector (cid:1)v, for each j = 1,...,k/2, the sum of squares of consecutive coordinates Yj = ||f((cid:1)v)2j−1||2 +||f((cid:1)v)2j||2 has exponential distribution with exponent 1/2. The expectation of L=||f((cid:1)v)||2 is equal to ΣjE[Yj] = k. It can be shown that the value of L lies within (cid:3) of its meanwithprobability1−F.Thustheexpectednumberofvectorswhoselength is approximately preserved is (1−F)m. The JL lemma has been proven useful in improving substantially many ap- proximationalgorithms(e.g.,[65,17]).Recentlyin[40],adeterministicalgorithm
Description: