Table Of ContentStudies in Theoretical and Applied Statistics
Selected Papers ofthe Statistical Societies
Forfurthervolumes:
http://www.springer.com/series/10104
SeriesEditors
SpanishSocietyofStatisticsandOperationsResearch(SEIO)
IgnacioGarciaJurado
Socie´te´Franc¸aisedeStatistique(SFdS)
AvnerBar-Hen
Societa`ItalianadiStatistica(SIS)
MaurizioVichi
SociedadePortuguesadeEstat´ıstica(SPE)
CarlosBraumann
Agostino Di Ciaccio Mauro Coli
(cid:2)
Jose Miguel Angulo IbanQez
Editors
Advanced Statistical Methods
for the Analysis of Large
Data-Sets
123
Editors
AgostinoDiCiaccio MauroColi
UniversityofRoma“LaSapienza” Dept.ofEconomics
Dept.ofStatistics University“G.d’Annunzio”,Chieti-Pescara
P.leAldoMoro5 V.lePindaro42
00185Roma Pescara
Italy Italy
agostino.diciaccio@uniroma1.it coli@unich.it
JoseMiguelAnguloIbanQez
DepartamentodeEstad´ısticaeInvestigacio´n
Operativa,UniversidaddeGranada
CampusdeFuentenuevas/n
18071Granada
Spain
jmangulo@ugr.es
ThisvolumehasbeenpublishedthankstothecontributionofISTAT-IstitutoNazionaledi
Statistica
ISBN978-3-642-21036-5 e-ISBN978-3-642-21037-2
DOI10.1007/978-3-642-21037-2
SpringerHeidelbergDordrechtLondonNewYork
LibraryofCongressControlNumber:2012932299
(cid:2)c Springer-VerlagBerlinHeidelberg2012
Thisworkissubjecttocopyright.Allrightsarereserved,whetherthewholeorpartofthematerialis
concerned,specificallytherightsoftranslation,reprinting,reuseofillustrations,recitation,broadcasting,
reproductiononmicrofilmorinanyotherway,andstorageindatabanks.Duplicationofthispublication
orpartsthereofispermittedonlyundertheprovisionsoftheGermanCopyrightLawofSeptember9,
1965,initscurrentversion,andpermissionforusemustalwaysbeobtainedfromSpringer.Violations
areliabletoprosecutionundertheGermanCopyrightLaw.
Theuseofgeneral descriptive names,registered names, trademarks, etc. inthis publication doesnot
imply,evenintheabsenceofaspecificstatement,thatsuchnamesareexemptfromtherelevantprotective
lawsandregulationsandthereforefreeforgeneraluse.
Printedonacid-freepaper
SpringerispartofSpringerScience+BusinessMedia(www.springer.com)
Editorial
Dearreader,onbehalfofthefourScientificStatisticalSocieties:SEIO,Sociedadde
Estad´ıstica e Investigacio´n Operativa (Spanish Statistical Society and Operation
Research); SFC, Socie´te´ Franc¸aise de Statistique (French Statistical Society);
SIS, Societa` Italiana di Statistica (Italian Statistical Society); SPE, Sociedade
Portuguesa de Estat´ıstica (Portuguese Statistical Society), we inform you that
this is a new book series of Springer entitled Studies in Theoretical and Applied
Statistics, with two lines of books published in the series “Advanced Studies”;
“SelectedPapersoftheStatisticalSocieties.”Thefirstlineofbooksoffersconstant
up-to-dateinformationon themostrecentdevelopmentsandmethodsin the fields
ofTheoreticalStatistics, AppliedStatistics, andDemography.Booksin this series
aresolicitedinconstantcooperationamongStatisticalSocietiesandneedtoshowa
high-levelauthorshipformedbyateampreferablyfromdifferentgroupstointegrate
differentresearchpointsofview.
The second line of books proposes a fully peer-reviewed selection of papers
on specific relevant topics organized by editors, also in occasion of conferences,
to show their research directions and developments in important topics, quickly
and informally, but with a high quality. The explicit aim is to summarize and
communicatecurrentknowledgein an accessible way. Thisline of bookswill not
includeproceedingsofconferencesandwishestobecomeapremiercommunication
mediuminthescientificstatisticalcommunitybyobtainingtheimpactfactor,asit
isthecaseofotherbookseriessuchas,forexample,“lecturenotesinmathematics.”
The volumes of Selected Papers of the Statistical Societies will cover a broad
scope of theoretical, methodological as well as application-oriented articles,
surveys,anddiscussions.Amajorpurposeistoshowtheintimateinterplaybetween
various,seeminglyunrelateddomainsandtofosterthecooperationamongscientists
indifferentfieldsbyofferingwell-basedandinnovativesolutionstourgentproblems
ofpractice.
Onbehalfofthefoundingstatisticalsocieties,IwishtothankSpringer,Heidel-
bergandinparticularDr.MartinaBihnforthehelpandconstantcooperationinthe
organizationofthisnewandinnovativebookseries.
MaurizioVichi
v
•
Preface
Many research studies in the social and economic fields regard the collection
and analysis of large amounts of data. These data sets vary in their nature and
complexity,theymaybeone-offorrepeated,andtheymaybehierarchical,spatial,
or temporal. Examples include textual data, transaction-based data, medical data,
andfinancialtimeseries.
Today most companies use IT to support all business automatic function; so
thousandsofbillionsofdigitalinteractionsandtransactionsarecreatedandcarried
out by various networks daily. Some of these data are stored in databases; most
endsupinlogfilesdiscardedonaregularbasis,losingvaluableinformationthatis
potentiallyimportant,butoftenhardtoanalyze.Thedifficultiescouldbeduetothe
datasize, forexamplethousandsofvariablesandmillionsof units,butalso to the
assumptionsaboutthegenerationprocessofthedata,therandomnessofsampling
plan,thedataquality,andsoon.Suchstudiesaresubjecttotheproblemofmissing
datawhenenrolledsubjectsdonothavedatarecordedforallvariablesofinterest.
More specific problemsmay relate, for example, to the mergingof administrative
dataortheanalysisofalargenumberoftextualdocuments.
Standard statistical techniques are usually not well suited to manage this type
of data, and many authors have proposed extensions of classical techniques or
completely new methods. The huge size of these data sets and their complexity
require new strategies of analysis sometimes subsumed under the terms “data
mining” or “predictive analytics.” The inference uses frequentist, likelihood, or
Bayesian paradigms and may utilize shrinkage and other forms of regularization.
Thestatisticalmodelsaremultivariateandaremainlyevaluatedbytheircapability
topredictfutureoutcomes.
This volume contains a peer review selection of papers, whose preliminary
version was presented at the meeting of the Italian Statistical Society (SIS), held
23–25September2009inPescara,Italy.
Thethemeofthemeetingwas“StatisticalMethodsfortheanalysisoflargedata-
sets,”atopicthatisgaininganincreasinginterestfromthescientificcommunity.
Themeetingwastheoccasionthatbroughttogetheralargenumberofscientists
and experts, especially from Italy and Europeancountries, with 156 papers and a
vii
viii Preface
largenumberofparticipants.Itwasahighlyappreciatedopportunityofdiscussion
andmutualknowledgeexchange.
Thisvolumeisstructuredin11chaptersaccordingtothefollowingmacrotopics:
(cid:129) Clusteringlargedatasets
(cid:129) Statisticsinmedicine
(cid:129) Integratingadministrativedata
(cid:129) Outliersandmissingdata
(cid:129) Timeseriesanalysis
(cid:129) Environmentalstatistics
(cid:129) Probabilityanddensityestimation
(cid:129) Applicationineconomics
(cid:129) WEBandtextmining
(cid:129) Advancesonsurveys
(cid:129) Multivariateanalysis
Ineachchapter,weincludedonlythreetofourpapers,selectedafteracarefulreview
process carried out after the conference, thanks to the valuable work of a good
numberofreferees.Selectingonlyafewrepresentativepapersfromtheinteresting
programprovedtobeaparticularlydauntingtask.
Wewishtothanktherefereeswhocarefullyreviewedthepapers.
Finally,wewouldliketothankDr.M.BihnandA.BlanckfromSpringer-Verlag
fortheexcellentcooperationinpublishingthisvolume.
It is worthy to note the wide range of different topics included in the selected
papers,whichunderlinesthelargeimpactofthetheme“statisticalmethodsforthe
analysis of large data sets” on the scientific community.This bookwishes to give
newideas,methods,andoriginalapplicationstodealwiththecomplexityandhigh
dimensionalityofdata.
SapienzaUniversita`diRoma,Italy AgostinoDiCiaccio
Universita`G.d’Annunzio,Pescara,Italy MauroColi
UniversidaddeGranada,Spain Jose´MiguelAnguloIbanQez
Contents
PartI ClusteringLargeData-Sets
ClusteringLargeDataSet:AnAppliedComparativeStudy................ 3
LauraBocciandIsabellaMingo
Clustering in Feature Space for Interesting Pattern
IdentificationofCategoricalData.............................................. 13
MarinaMarino,FrancescoPalumboandCristinaTortora
ClusteringGeostatisticalFunctionalData..................................... 23
ElviraRomanoandRosannaVerde
Joint Clustering and Alignment of Functional Data:An
ApplicationtoVascularGeometries ........................................... 33
LauraM.Sangalli,PiercesareSecchi,SimoneVantini,andValeria
Vitelli
PartII StatisticsinMedicine
BayesianMethodsforTime Course MicroarrayAnalysis:
FromGenes’DetectiontoClustering.......................................... 47
ClaudiaAngelini,DanielaDeCanditiis,andMariannaPensky
Longitudinal Analysis of Gene Expression
ProfilesUsingFunctionalMixed-EffectsModels............................. 57
MauriceBerk,CherylHemingway,MichaelLevin,andGiovanni
Montana
A PermutationSolutionto CompareTwo Hepatocellular
CarcinomaMarkers ............................................................. 69
AgataZirilliandAngelaAlibrandi
ix
x Contents
PartIII IntegratingAdministrativeData
StatisticalPerspectiveonBlockingMethodsWhenLinking
LargeData-sets................................................................... 81
NicolettaCibellaandTizianaTuoto
IntegratingHouseholdsIncomeMicrodataintheEstimate
oftheItalianGDP................................................................ 91
AlessandraColiandFrancescaTartamella
The EmploymentConsequencesofGlobalization:Linking
DataonEmployersandEmployeesintheNetherlands ..................... 101
FabienneFortanier,MarjoleinKorvorst,andMartinLuppes
ApplicationsofBayesianNetworksinOfficialStatistics..................... 113
PaolaVicardandMauroScanu
PartIV OutliersandMissingData
ACorrelatedRandomEffectsModelforLongitudinalData
withNon-ignorableDrop-Out:AnApplicationtoUniversity
StudentPerformance ............................................................ 127
FilippoBelloc,AntonelloMaruotti,andLeaPetrella
RiskAnalysisApproachestoRankOutliersinTradeData................. 137
VytisKopustinskasandSpyrosArsenis
ProblemsandChallengesintheAnalysisofComplexData:
StaticandDynamicApproaches................................................ 145
MarcoRiani,AnthonyAtkinsonandAndreaCerioli
EnsembleSupportVectorRegression:ANewNon-parametric
ApproachforMultipleImputation............................................. 159
DariaScacciatelli
PartV TimeSeriesAnalysis
Onthe Use ofPLSRegressionfor ForecastingLarge Sets
ofCointegratedTimeSeries .................................................... 171
GianlucaCubaddaandBarbaraGuardabascio
Large-ScalePortfolioOptimisationwithHeuristics.......................... 181
ManfredGilliandEnricoSchumann
DetectingShort-TermCyclesinComplexTimeSeriesDatabases.......... 193
F.Giordano,M.L.ParrellaandM.Restaino