Table Of ContentLecture Notes in Computer Science 3484
CommencedPublicationin1973
FoundingandFormerSeriesEditors:
GerhardGoos,JurisHartmanis,andJanvanLeeuwen
EditorialBoard
DavidHutchison
LancasterUniversity,UK
TakeoKanade
CarnegieMellonUniversity,Pittsburgh,PA,USA
JosefKittler
UniversityofSurrey,Guildford,UK
JonM.Kleinberg
CornellUniversity,Ithaca,NY,USA
FriedemannMattern
ETHZurich,Switzerland
JohnC.Mitchell
StanfordUniversity,CA,USA
MoniNaor
WeizmannInstituteofScience,Rehovot,Israel
OscarNierstrasz
UniversityofBern,Switzerland
C.PanduRangan
IndianInstituteofTechnology,Madras,India
BernhardSteffen
UniversityofDortmund,Germany
MadhuSudan
MassachusettsInstituteofTechnology,MA,USA
DemetriTerzopoulos
NewYorkUniversity,NY,USA
DougTygar
UniversityofCalifornia,Berkeley,CA,USA
MosheY.Vardi
RiceUniversity,Houston,TX,USA
GerhardWeikum
Max-PlanckInstituteofComputerScience,Saarbruecken,Germany
Evripidis Bampis Klaus Jansen
Claire Kenyon (Eds.)
EfficientApproximation
and Online Algorithms
Recent Progress on Classical Combinatorial
Optimization Problems and New Applications
1 3
VolumeEditors
EvripidisBampis
Universitéd’ÉvryVald’Essonne
LaMI,CNRSUMR8042
523,PlacedesTerasses,TourEvry2,91000EvryCedex,France
E-mail:bampis@lami.univ-evry.fr
KlausJansen
UniversityofKiel
InstituteforComputerScienceandAppliedMathematics
Olshausenstr.40,24098Kiel,Germany
E-mail:kj@informatik.uni-kiel.de
ClaireKenyon
BrownUniversity
DepartmentofComputerScience
Box1910,Providence,RI02912,USA
E-mail:claire@cs.brown.edu
LibraryofCongressControlNumber:2006920093
CRSubjectClassification(1998):F.2,C.2,G.2-3,I.3.5,G.1.6,E.5
LNCSSublibrary:SL1–TheoreticalComputerScienceandGeneralIssues
ISSN 0302-9743
ISBN-10 3-540-32212-4SpringerBerlinHeidelbergNewYork
ISBN-13 978-3-540-32212-2SpringerBerlinHeidelbergNewYork
Thisworkissubjecttocopyright.Allrightsarereserved,whetherthewholeorpartofthematerialis
concerned,specificallytherightsoftranslation,reprinting,re-useofillustrations,recitation,broadcasting,
reproductiononmicrofilmsorinanyotherway,andstorageindatabanks.Duplicationofthispublication
orpartsthereofispermittedonlyundertheprovisionsoftheGermanCopyrightLawofSeptember9,1965,
initscurrentversion,andpermissionforusemustalwaysbeobtainedfromSpringer.Violationsareliable
toprosecutionundertheGermanCopyrightLaw.
SpringerisapartofSpringerScience+BusinessMedia
springeronline.com
©Springer-VerlagBerlinHeidelberg2006
PrintedinGermany
Typesetting:Camera-readybyauthor,dataconversionbyScientificPublishingServices,Chennai,India
Printedonacid-freepaper SPIN:11671541 06/3142 543210
Preface
In this book, we present some recent advances in the field of combinatorial
optimization focusing on the design of efficient approximation and on-line al-
gorithms. Combinatorial optimization and polynomial time approximation are
verycloselyrelated:givenanNP-hardcombinatorialoptimizationproblem,i.e.,
a problem for which no polynomial time algorithm exists unless P =NP, one
important approach used by computer scientists is to consider polynomial time
algorithmsthat do not produce optimum solutions,but solutions that are prov-
ably close to the optimum. A natural partition of combinatorial optimization
problems into two classes is then of both practical and theoretical interest: the
problemsthatarefully approximable,i.e.,thoseforwhichthereisanapproxima-
tion algorithm that can approach the optimum with any arbitrary precision in
terms ofrelative errorandthe problems that arepartly approximable, i.e.,those
for which it is possible to approachthe optimum only until a fixed factor unless
P =NP. For some of these problems, especially those that are motivated by
practical applications, the input may not be completely known in advance, but
revealedduringtime.Inthiscase,knownastheon-linecase,thegoalistodesign
algorithms that are able to produce solutions that are close to the best possible
solution that can be produced by any off-line algorithm, i.e., an algorithm that
knows the input in advance.
Theseissueshavebeentreatedinsomerecenttexts1,butinthelastfewyears
a huge amount of new results have been produced in the area of approximation
andon-linealgorithms.Thisbookisdevotedtothestudyofsomeclassicalprob-
lems of scheduling, of packing, and of graph theory, but also new optimization
problems arising in various applications such as networks, data mining or clas-
sification. One central idea in the book is to use a linear program relaxation of
the problem, randomization and rounding techniques.
The book is divided into 11 chapters. The chapters are self-contained and
may be read in any order.
In Chap. 1, the goal is the introduction of a theoretical framework for deal-
ing with data mining applications. Some of the most studied problems in this
area as well as algorithmic tools are presented. Chap. 2 presents a survey con-
cerning local search and approximation. Local search has been widely used in
the core of many heuristic algorithms and produces excellent practical results
for many combinatorial optimization problems. The objective here is to com-
1 V.Vazirani,ApproximationAlgorithms,SpringerVerlag,Berlin,2001;G.Ausielloet
al,ComplexityandApproximation:CombinatorialOptimizationProblemsandTheir
Approximability, Springer Verlag, 1999; D. S. Hochbaum, editor, Approximation
Algorithms for NP-Hard Problems, PWS Publishing Company, 1997; A. Borodin,
R.El-Yaniv,On-lineComputation and Competitive Analysis, Cambridge University
Press, 1998, A.Fiat and G. J. Woeginger, editors, Online Algorithms: The State of
the Art, LNCS 1442. Springer-Verlag, Berlin, 1998.
VI Preface
pare from a theoretical point of view the quality of local optimum solutions
with respect to a global optimum solution using the notion of the approxima-
tion factor and to review the most important results in this direction. Chap. 3
surveys the wavelength routing problem in the case where the underlying op-
tical network is a tree. The goal is to establish the requested communication
connections but using the smallest total number of wavelengths. In the case of
trees this problem is reduced to the problem of finding a set of transmitter-
receiver paths and assigning a wavelength to each path so that no two paths of
the same wavelengthsharethe same fiber link.Approximationandon-linealgo-
rithms, as well as hardness results and lower bound, are presented. In Chap. 4,
acall admission control problem isconsideredinwhichtheobjectiveisthe max-
imization of the number of accepted communication requests. This problem is
formalized as an edge-disjoint-path problem in (non)-oriented graphs and the
most important (non)-approximability results, for arbitrary graphs, as well as
forsomeparticulargraphclasses,arepresented.Furthermore,combinatorialand
linearprogrammingalgorithmsarereviewedforageneralizationoftheproblem,
the unsplittable flow problem. Chap. 5 is focused on a special class of graphs,
the intersection graphs of disks. Approximation and on-line algorithms are pre-
sented for the maximum independent set and coloring problems in this class. In
Chap.6,ageneraltechniqueforsolvingmin-maxandmax-minresourcesharing
problemsis presentedanditisappliedto twoapplications:schedulingunrelated
machines and strip packing. In Chap. 7, a simple analysis is proposed for the
on-line problem of scheduling preemptively a set of tasks in a multiprocessor
setting in order to minimize the flow time (total time of the tasks in the sys-
tem).InChap.8,approximationresultsarepresentedforageneralclassification
problem, the labeling problem which arises in several contexts and aims to clas-
sify related objects by assigning to each of them one label. In Chap. 9, a very
efficient tool for designing approximation algorithms for scheduling problems is
presented, the list scheduling in order of α-points, and it is illustrated for the
single machine problem where the objective function is the sum of weighted
completiontimes.Chap.10is devotedto the study ofone classicaloptimization
problem,thek-medianproblemfromtheapproximationpointofview.Themain
algorithmic approaches existing in the literature as well as the hardness results
are presented. Chap. 11 focuses on a powerful tool for the analysis of random-
ized approximation algorithms, the Lova´sz-Local-Lemma which is illustrated
in two applications: the job shop scheduling problem and resource-constrained
scheduling.
We take the opportunity to thank all the authors and the reviewers for their
important contribution to this book. We gratefully acknowledge the support
from the EU Thematic Network APPOL I+II (Approximation and Online Al-
gorithms). We also thank Ute Iaquinto and Parvaneh Karimi Massouleh from
the University of Kiel for their help.
September 2005 Evripidis Bampis, Klaus Jansen, and Claire Kenyon
Table of Contents
Contributed Talks
On Approximation Algorithms for Data Mining Applications
Foto N. Afrati ................................................ 1
A Survey of Approximation Results for Local Search Algorithms
Eric Angel ................................................... 30
Approximation Algorithms for Path Coloring in Trees
Ioannis Caragiannis, Christos Kaklamanis, Giuseppe Persiano...... 74
Approximation Algorithms for Edge-Disjoint Paths and Unsplittable
Flow
Thomas Erlebach ............................................. 97
Independence and Coloring Problems on Intersection Graphs of Disks
Thomas Erlebach, Jiˇr´ı Fiala.................................... 135
Approximation Algorithms for Min-Max and Max-Min Resource
Sharing Problems, and Applications
Klaus Jansen................................................. 156
A Simpler Proof of Preemptive Total Flow Time Approximation on
Parallel Machines
Stefano Leonardi.............................................. 203
Approximating a Class of Classification Problems
Ioannis Milis ................................................. 213
List Scheduling in Order of α-Points on a Single Machine
Martin Skutella............................................... 250
Approximation Algorithms for the k-Median Problem
Roberto Solis-Oba ............................................. 292
The Lova´sz-Local-Lemmaand Scheduling
Anand Srivastav .............................................. 321
Author Index.................................................. 349
On Approximation Algorithms for Data
Mining Applications
Foto N. Afrati
National Technical University of Athens,Greece
Abstract. Weaimtopresentcurrenttrendsinthetheoreticalcomputer
science research on topics which have applications in data mining. We
briefly describe data mining tasks in various application contexts. We
giveanoverviewofsomeofthequestionsandalgorithmicissuesthatare
of concern when mining huge amounts of data that do not fit in main
memory.
1 Introduction
Data mining is about extracting useful information from massive data such as
finding frequently occurring patterns or finding similar regions or clustering the
data. The advent of the internet has added new applications and challenges to
thisarea.Fromthealgorithmicpointofviewminingalgorithmsseektocompute
good approximate solutions to the problem at hand. As a consequence of the
huge size of the input, algorithms are usually restricted to making only a few
passes over the data, and they have limitations on the random access memory
they use and the time spent per data item.
The input in a data mining task can be viewed, in most cases, as a two di-
mensional m×n 0,1-matrix which often is sparse. This matrix may represent
several objects such as a collection of documents (each row is a document and
each column is a word and there is a 1 entry if the word appears in this doc-
ument), or a collection of retail records (each row is a transaction record and
each column represents an item, there is a 1 entry if the item was bought in
this transaction), or both rows and columns are sites on the web and there is a
1 entry if there is a link from the one site to the other. In the latter case, the
matrixis often viewedas agraphtoo.Sometimes the matrixcanbe viewedasa
sequence of vectors (its rows) or even a sequence of vectors with integer values
(not only 0,1).
The performance of a data mining algorithm is measured in terms of the
number of passes, the required work space in main memory and computation
time per data item. A constant number of passes is acceptable but one pass al-
gorithms are mostly sought for. The workspace available ideally is constant but
sublinearspacealgorithmsarealsoconsidered.Thequalityofthe outputisusu-
ally measured using conventional approximation ratio measures [97], although
insomeproblemsthe notionofapproximationandthemannerofevaluatingthe
results remain to be further investigated.
E.Bampisetal.(Eds.):ApproximationandOnlineAlgorithms,LNCS3484,pp.1–29,2006.
(cid:1)c Springer-VerlagBerlinHeidelberg2006
2 F.N. Afrati
These performance constraints call for designing novel techniques and novel
computational paradigms. Since the amount of data far exceeds the amount
of workspace available to the algorithm, it is not possible for the algorithm
to “remember” large amounts of past data. A recent approach is to create a
summaryofthepastdatatostoreinmainmemory,leavingalsoenoughmemory
for the processing of the future data. Using a random sample of the data is also
another popular technique.
Besides data mining, other applications can be also modeled as one pass
problemssuchasthe interfacebetweenthe storagemanagerandtheapplication
layer of a database system or processing data that are brought to desktop from
networks,whereeachpassessentiallyisanotherexpensiveaccesstothenetwork.
Severalcommunitieshavecontributed(withtechnicaltoolsandmethodsaswell
asbysolvingsimilarproblems)totheevolvingofthedataminingfield,including
statistics, machine learning and databases.
Manysinglepassalgorithmshavebeendevelopedrecentlyandalsotechniques
and tools that facilitate them. We will review some of them here. In the first
part of this chapter (next two sections), we review formalisms and technical
tools used to find solutions to problems in this area. In the rest of the chapter
webrieflydiscussrecentresearchinassociation rules,clusteringandwebmining.
Anassociationrulerelatestwocolumnsoftheentrymatrix(e.g.,ifthei-thentry
of a row v is 1 then most probably the j-th entry of v is also 1). Clustering the
rows of the matrix according to various similarity criteria in a single pass is
a new challenge which traditional clustering algorithms did not have. In web
mining, one problem of interest in search engines is to rank the pages of the
web according to their importance on a topic. Citation importance is taken by
popular search engines according to which important pages are assumed to be
those that are linked by other important pages.
Inmoredetailtherestofthechapterisorganizedasfollows.Thenextsection
containsformaltechniquesusedforsinglepassalgorithmsandaformalismforthe
datastreammodel.Section3containsanalgorithmwithperformanceguarantees
for finding approximately the Lp distance between two data streams. As an
example, Section 4 contains a list of what are considered the main data mining
tasks and another list with applications of these tasks. The last three sections
discussrecentalgorithmsdevelopedforfindingassociationrules,clusteringaset
of data items and for searching the web for useful information. In these three
sections,techniquesmentionedinthe beginningofthe chapterareused(suchas
SVD,sampling)tosolvethe specificproblems.Naturallysomeofthetechniques
are common, suchas, for example, spectralmethods are used in both clustering
and web mining. As the area is rapidly evolving this chapter serves as a brief
introduction to the most popular technical tools and applications.
2 Formal Techniques and Tools
Inthis sectionwepresentsometheoreticalresultsandformalismsthatareoften
used in developing algorithms for data mining applications. In this context, the
On Approximation Algorithms for Data Mining Applications 3
singularvaluedecomposition(SVD)ofamatrix(subsection2.1)hasinspiredweb
searchtechniques,and,asadimensionalityreductiontechnique,isusedforfind-
ing similarities among documents or clustering documents (known as the latent
semanticindexingtechniquefordocumentanalysis).Randomprojections(subsec-
tion2.1)offeranothermeansfordimensionalityreductionexploredinrecentwork.
Datastreams(subsection2.2)isproposedformodelinglimitedpassalgorithms;in
thissubsectionsomediscussionisdoneonlowerandupperboundsontherequired
workspace.Samplingtechniques(subsection2.3)havealsobeenusedinstatistics
andlearningtheory,undersomewhatdifferentperspectivehowever.Storingasam-
pleofthedatathatfitsinmainmemoryandrunninga“conventional”algorithm
onthissampleisoftenusedasthefirststageofvariousdataminingalgorithms.We
presentacomputationalmodelforprobabilisticsamplingalgorithmsthatcompute
approximatesolutions.Thismodelisbasedonthedecisiontreemodel[27]andre-
latesthequerycomplexitytothesizeofthesample.
We start by providing some (mostly) textbook definitions for self contain-
ment purposes. In data mining we are interested in vectors and their relation-
ships under several distance measures. For two vectors,(cid:1)v = (v1,...,vn), (cid:1)u =
(u1,...,un), the dot product orinner product is defined tobe anumber whichis
equal to the sum of the component-wise products(cid:1)v·(cid:1)u=v1u1+...+vnun and
theLp distance(orLp norm)isdefinedtobe:||(cid:1)v−(cid:1)u||p =(Σin=1|vi−ui|p)1/p.For
p=∞, L∞ distance is equalto maxni=1|ui−vi|. The Lp distance is extended to
bedefinedbetweenmatrices:||V(cid:1) −U(cid:1)||p =(Σi(Σj|Vij−Uij|p))1/p.Wesometimes
use || || to denote || || . The cosine distance is defined to be 1− (cid:1)v·(cid:1)u . For
2 ||(cid:1)v|| ||(cid:1)u||
sparse matrices the cosine distance is a suitable similarity measure as the dot
product deals only with non-zero entries (which are the entries that containthe
information) and then it is normalized over the lengths of the vectors.
Some results are based on stable distributions [85]. A distribution D over the
reals is called p-stable if for any n real numbers a1,...,an and independent
identically distributed, with distribution D, variables X1,...,Xn, the random
variable ΣiaiXi has the same distribution as the variable (Σi|ai|p)1/pX, where
X is a random variable with the same distribution as the variables X1,...,Xn.
It is known that stable distributions exist for any p ∈ (0,2]. A Cauchy distri-
bution defined by the density function 1 , is 1-stable,a Gaussian(normal)
π(1+x2)
distribution defined by the density function √1 e−x2/2, is 2-stable.
2π
A randomized algorithm [81] is an algorithm that flips coins, i.e., it uses ran-
dom bits, while no probabilistic assumption is made on the distribution of the
input. A randomized algorithmis called Las-Vegas if it gives the correctanswer
onallinputs.Itsrunningtimeorworkspacecouldbearandomvariabledepend-
ing on the random variable of the coin tosses. A randomized algorithm is called
Monte-Carlo with error probability (cid:3) if on every input it gives the right answer
with probability at least 1−(cid:3).
2.1 Dimensionality Reduction
Given a set S of points in the multidimensional space,dimensionality reduction
techniquesareusedtomapS toasetS(cid:4) ofpoints inaspaceofmuchsmallerdi-
4 F.N. Afrati
mensionality while approximately preserving important properties of the points
inS.Usuallywewanttopreservedistances.Dimensionalityreductiontechniques
can be useful in many problems where distance computations and comparisons
are needed. In high dimensions distance computations are very slow and more-
overitisknownthat,inthiscase,thedistancebetweenalmostallpairsofpoints
is the same with high probability and almost all pair of points are orthogonal
(known as the Curse of Dimensionality).
Dimensionality reduction techniques that are popular recently include Ran-
dom Projections and Singular Value Decomposition (SVD). Other dimensional-
ity reduction techniques use linear transformations such as the Discrete Cosine
transformorHaarWaveletcoefficientsortheDiscreteFourierTransform(DFT).
DFT is a heuristic which is based on the observation that, for many sequences,
most of the energy of the signal is concentrated in the first few components of
DFT. The L distance is preservedexactly under the DFT and its implementa-
2
tion is also practically efficient due to an O(nlogn) DFT algorithm.
Dimensionality reduction techniques are well explored in databases [51,43].
Random Projections. Random Projection techniques are based on the
Johnson-Lindenstrauss (JL) lemma [67] which states that any set of n points
canbe embedded into the k-dimensionalspace with k =O(logn/(cid:3)2) so thatthe
distances are preservedwithin a factor of (cid:3).
Lemma 1. (JL) Let (cid:1)v1,...,(cid:1)vm be a sequence of points in the d-dimensional
space over the reals and let (cid:3),F ∈ (0,1]. Then there exists a linear mapping f
from the points of the d-dimensional space into the points of the k-dimensional
space where k = O(log(1/F)/(cid:3)2) such that the number of vectors which ap-
proximately preserve their length is at least (1−F)m. We say that a vector (cid:1)vi
approximately preserves its length if:
||(cid:1)vi||2 ≤||f((cid:1)vi)||2 ≤(1+(cid:3))||(cid:1)vi||2
Theproofofthelemma,however,isnon-constructive:itshowsthatarandom
mapping induces small distortions with high probability. Severalversions of the
proof exist in the literature. We sketch the proof from [65]. Since the mapping
is linear,we canassume without loss of generalitythat the(cid:1)vi’s are unit vectors.
Thelinearmappingf isgivenbyak×dmatrixA(cid:1) andf((cid:1)vi)=A(cid:1)(cid:1)vi,i=1,...,m.
By choosingthe matrix A(cid:1) at randomsuchthat eachof its coordinates is chosen
independently from N(0,1), then each coordinate of f((cid:1)vi) is also distributed
according to N(0,1) (this is a consequence of the spherical symmetry of the
normal distribution). Therefore, for any vector (cid:1)v, for each j = 1,...,k/2, the
sum of squares of consecutive coordinates Yj = ||f((cid:1)v)2j−1||2 +||f((cid:1)v)2j||2 has
exponential distribution with exponent 1/2. The expectation of L=||f((cid:1)v)||2 is
equal to ΣjE[Yj] = k. It can be shown that the value of L lies within (cid:3) of its
meanwithprobability1−F.Thustheexpectednumberofvectorswhoselength
is approximately preserved is (1−F)m.
The JL lemma has been proven useful in improving substantially many ap-
proximationalgorithms(e.g.,[65,17]).Recentlyin[40],adeterministicalgorithm
Description:This book provides a good opportunity for computer science practitioners and researchers to get in sync with the current state-of-the-art and future trends in the field of combinatorial optimization and online algorithms. Recent advances in this area are presented focusing on the design of efficient