Visual Analytics of Image-Centric Cohort Studies in Epidemiology BernhardPreim,PaulKlemm,HelwigHauser,KatrinHegenscheid,SteffenOeltze, KlausToennies,andHenryVo¨lzke 5 1 0 2 n a J 5 AbstractEpidemiologycharacterizestheinfluenceofcausestodiseaseandhealth 1 conditionsof defined populations.Cohortstudies are population-basedstudies in- volvingusuallylargenumbersofrandomlyselectedindividualsandcomprisingnu- ] V merousattributes,rangingfromself-reportedinterviewdatatoresultsfromvarious medicalexaminations,e.g.,bloodandurinesamples.Sincerecently,medicalimag- C ing has been used as an additional instrument to assess risk factors and potential . s prognosticinformation.Inthischapter,wediscusssuchstudiesandhowtheevalu- c ation maybenefitfromvisualanalytics. Cluster analysis to definegroups,reliable [ image analysis of organsin medical imaging data and shape space explorationto 1 characterizeanatomicalshapesareamongthevisualanalyticstoolsthatmayenable v epidemiologiststofullyexploitthepotentialoftheirhugeandcomplexdata.Togain 9 0 acceptance,visualanalyticstoolsneedtocomplementmoreclassicalepidemiologic 0 tools,primarilyhypothesis-drivenstatisticalanalysis. 4 0 . 1 0 1 Introduction 5 1 Epidemiologyisascientificdisciplinethatprovidesreliableknowledgeforclinical : v medicinefocusingonprevention,diagnosisandtreatmentofdiseases[14].Research i X inepidemiologyaimsatcharacterizingriskfactorsfortheoutbreakofdiseasesand r at evaluating the efficiency of certain treatment strategies, e.g., to compare a new a treatment with an established gold standard. This research is strongly hypothesis- drivenandstatisticalanalysisisthemajortoolforepidemiologistssofar.Correla- tions between genetic factors, environmentalfactors, life style-related parameters, age and diseases are analyzed. The data are acquired by a mixture of interviews (self-reported data, e.g., about nutrition and previous infections) and clinical ex- BernhardPreim Otto-von-Guericke University Magdeburg, 39106 Magdeburg, e-mail: [email protected] 1 2 AuthorsSuppressedDuetoExcessiveLength aminations,suchasmeasurementofbloodpressure.Statisticalcorrelations,evenif theyarestrong,maybemisleadingbecausetheydonotrepresentcausalrelations. As an example, the slightly reducedrisk of heart infarctand cardiac mortality for elderlypeoplereportingtodrinkoneglassofwineeveryevening(comparedtopeo- pledrinkingnoalcoholatall)maybeduetotheinvolvedlowlevelofalcoholbut may also be a consequence of a very regular and stress-free lifestyle [14]. When something happened, before an event, it is an indicator for a causal relationship. However, care is necessary, since many things happen in the life of an individual before,e.g.,aheartattack,butdonotcauseit. Thus, statistical correlationsare the starting point for investigatingwhy certain factorsincreasetheriskofgettingdiseases.Epidemiologyisnotapurelyacademic endeavorbuthashugeconsequencesforestablishingandevaluatingpreventivemea- sures even outside of medicine. The protection of people from passive smoking, recommendationsforvariousvaccinationsandtheintroductionofearlycancerde- tection strategies, e.g., mammographyscreening, are all based on large-scale epi- demiological studies. Also the official guidelines for the treatment of widespread diseases,suchasdiabetes,arebasedonevidencefromepidemiologicalstudies[14]. Whilethisallmaysoundobvious,itisaratherrecentdevelopment.Evidence-based medicineoftenstillhasto“fight”againstrecommendationsofafewopinionleaders arguingbasedontheirpersonalexperienceonly. Theanalysistechniquesusedsofararelimitedtoinvestigatinghypothesesbased onknownorsuspectedrelations,e.g.hypothesesrelatedtoobservationsorprevious publications.Theavailabletoolssupporttheanalysisofafewdimensions,butnotof thehundredsofattributesacquiredperindividualinacohortstudy.Bothtypicalvi- sualizationtechniquesaswellasanalysistechniques,e.g.,supportvectormachines, donotscalewellforhundredsofattributes[41].Whilewearenotabletodescribe solutions for these challenging problems, we give a survey on recent approaches aimingalsoathypothesisgeneration. Organization.Thischapteris organizedasfollows.In Sect.2 we describeim- portantconceptsandtermsofepidemiologyincludingobservationsfromepidemi- ologic workflows. This discussion is restricted to those terms that are crucial for communicatingwith epidemiologists,understandingrequirementsand for design- ingsolutionsthatfitintheirprocess.InSect.3,wediscusshow(general)informa- tionvisualizationanddataanalysistechniquesmaybeusedforepidemiologicdata. Section 4 describes the analysis of image data from cohort studies and how this analysisiscombinedwiththeexplorationofnon-imageattributedata.Thissection representsthecoreofthechapterandemploysacasestudywhereMRIdataofthe lumbar spine are analyzed along with attributes characterizing life-style, working habits,andbackpainhistory. VisualAnalyticsofImage-CentricCohortStudiesinEpidemiology 3 2 Background inEpidemiology Population-based studies. Epidemiological studies are based on a sample of the population.Thereliabilityoftheresultsobviouslydependsonthesizeofthatsample but also stronglyon the selection criteria. Often, data from patientstreated in one hospitalare analyzed.While this may be a large numberof patients, the selection maybeheavilybiased,e.g.,sincethehospitalishighlyspecializedanddiseasesare oftenmoresevereorinalaterstagecomparedtothegeneralpopulation. Population-basedstudies,whererepresentativeportionsofapopulation(without known diseases) are examined, have the potential to yield highly reliable results. The source population may be from a city, a region or a country. Individuals are randomlyselected,e.g.,approachingdatabasesofpopulationregistries.Thehigher thepercentageofpeoplewhoaccepttheinvitationandactuallytakepartinthestudy, themorereliabletheresultsare. In this chapter, we focus on longitudinal population-based studies. The sheer amountanddiversityintermsoftypeofdatamakesitdifficulttofullyidentifyand analyzeinterestingrelations.Wewillshowthatinformationvisualizationandvisual analyticstechniquesmayprovidesubstantialsupportthatcomplementsthestatisti- caltoolswiththeirrathersimplestatisticalgraphics.Mostepidemiologicalstudies wererestrictedtonominal(oftencalledcategorical)andscalardata,e.g.,relatedto alcoholconsumption,andbodymassindexasonemeasureofobesity. Image-centricepidemiologicalstudies.Morerecently,forexample,intheRot- terdam study [22], also non-invasiveimaging data, primarily ultrasound and MRI data, are employed. Petersen and colleagues [32] report on six studies involving cardiacMRIfromatleast1000individualsinpopulation-basedstudies.Thesehigh- dimensionaldata enableto answer analysisquestions,e.g.,how doesthe shape of thespinechangesasaconsequenceofage,lifestyleanddiseases?Wefocusonsuch image-centricepidemiologicalstudies. Epidemiologyandpublichealth.Therearedifferentbranchesofepidemiology. Onebranchdealswithpredictionstoinformpublichealthactivities.Theseinclude measuresin case of an epidemic– an acute publichealth problem,mostly related to infectious diseases. The recent article ”computational epidemiology” [29] was focussedonthisbranchofepidemiology.Anotherbranchofepidemiologyaimsat long-termstudies and at findingsprimarilyessential forprevention.Image-centric cohortstudies,thefocusofthisarticle,belongtothissecondbranch.Thetargetuser groupconsistsofepidemiologistswhocanbeexpectedto havea highlevelofex- pertise in statistics. Thus,their findingsinvolvestatistical significance,confidence intervalsandothermeasuresofstatisticalpower. Healthyagingandpathologicchanges.Anessentialprobleminthedailyclini- calroutineisthediscriminationbetweenhealthyage-relatedmodifications(thatmay notbereversedbytreatment)andearlystagediseases(thatmaybenefitfromimme- diatetreatment).Asaconsequence,elderlypeopleareoftennotadequatelytreated. 4 AuthorsSuppressedDuetoExcessiveLength Asageneralgoalforepidemiologicalstudies,betterandmorereliablemarkersfor early stage diseases are searched for. The cardiovascularbranchof the Rotterdam study,forexample,aimsatanunderstandingofatherosclerosis,coronaryheartdis- easeand“cardiovascularconditionsatolderage”[22]. Modern epidemiology. Epidemiology faces new challenges due to the rapid progress,e.g., in geneticsand sequencingtechnologyas well as medicalimaging. Acquisitionofhealthdatathusbecomescheaperandmoreprecise.Incohortstud- ies, as much potentiallyrelevantdata as possible are acquired as a basis for an as broadaspossiblespectrumofanalysisquestions.Thisincludesblood,urineandtis- suesamples,informationaboutenvironmentalconditionsandthesocialmilieu. Visualanalyticsformodernepidemiology.Inthepast,epidemiologyprimarily dealtwithhypothesesaimingtoprovethem,e.g.,theefficiencyofearlycancerde- tectionprogramsintermsofmortalityandlongtermsurvival[14].Sincerecently, more and more data mining is performedto identify correlations. Results of such analyses, however,need to be very carefully interpreted.If thousandsof potential correlationsare analyzed automatically,just by chance some of them will reach a highlevelofstatisticalsignificance. Anessentialsupportforepidemiologyresearchisto definerelevantsubgroups. Toperformseparateanalysesforwomenandmenaswellasfordifferentagegroups isacommonpracticeinepidemiology.However,relevantsubgroupsmaybedefined byanon-obviouscombinationofseveralattributesthatmaybedetectedbyacom- binationofclusteranalysisandappropriatevisualization. Sincetheinformationspaceisgrowingwitheachexaminationcycle,Pearceand Merletti[31]pointedoutin2006thatmethodsareneededwhichcancopewiththis complexityandenabletheanalysisofunderlyingcausesofacertaindisease.Visual analytics (VA) methods can support epidemiological data assessment in different ways, e.g. by defining subgroupsbased on a multitude of attributes that exhibit a certaincharacteristic.Fortheanalysisofscalarandcategoricaldata,establishedin- formationvisualizationtechniquescombinedwithclusteringanddimensionreduc- tion are a good starting point, but need to be tightly integratedwith statistic tools epidemiologists that are more familiar with. For image-centric studies, however, newvisualization,(image)analysisandinteractiontechniquesareneeded. Inthefollowing,wedefineessentialtermsinepidemiologyandgiveanoverview oncohortstudiesthatemploymedicalimagedataasanessentialelement.Finally, wedescribehowimagedata,derivedinformationandotherdatacomplementeach othertoidentifyandcharacterizerisks. 2.1 ImportantTerms Prevalence and incidence. Epidemiology investigates how often certain diseases orclinicalevents,suchasacerebralstrokeorsuddenheartdeath,occurinthepop- VisualAnalyticsofImage-CentricCohortStudiesinEpidemiology 5 ulation.Twotermsareimportanttocharacterizethisfrequency.Theprevalencein- dicatestheportionofpeoplesufferingfromadiseaseatagivenpointintime.The incidencerepresentshowmanypeoplesufferfromadiseaseoreventinacertainin- terval,usuallyoneyear.Highprevalenceisusuallyassociatedwithhigheconomic costs. Population-basedstudies focus on diseases with a high prevalence, such as diabetes,coronaryheartdiseaseorneurodegenerativediseases.Eventhesediseases do not occur frequently in a random population including many younger people (wheretheprevalenceofthesediseasesislow).Araredisease,suchasamyotrophic lateral sclerosis, may have a prevalence of 5 from 100,000. Thus, even in a large population-basedstudyprobablynoindividualsuffersfromthisdisease. Absoluteandrelativerisks.Anotheressentialepidemiologicaltermistherisk foraclinicalevent,suchasoutbreakofacertaindisease,severity(stage)ordeath. Asanexample,astudyrelatedtocardiacriskmayinvestigateanginapectoris,my- ocardial infarction, atrial fibrillation depending on attributes such as age and sex. Theabsoluterisk characterizesthelikelihoodofgettingadiseaseinlifetime.The absoluterisk fora womantodevelopbreastcancerintheWestern worldispartic- ularly high for women aged 50-60 (2.6%) and 60-70 (3.7%). Therefore,for these agegroups,mammographyscreening–aimingatearlydetectionandthusoptimal treatment–wasintroduced. Therelativerisk(RR)characterizestheincreasedriskifanindividualisexposed toacertainriskfactor,e.g.,smoking,excessiveweight,oralcoholabuse.Itisbased on a comparison with a control group not exposed to that risk factor. A value of RR<1 represents a factor that protects, e.g., moderate physicalactivity. Exciting observationsareoftenthecombinedeffectsofseveralparameters.Acertainfactor maybeprotectiveforsomepeople(younger,slimwomen)andisinvolvedwithan increasedriskforothers.Thecombinedriskmaybesignificantlysmallerorlarger thancouldbeexpectedfromindividualfactors. Moreover,relationships are often distinctly non-linear or even non-monotonic. Dose-response relationships are often non-linear. RR increases slowly (almost no effectfora smalldose)andincreasesmuchfasterforhigherlevelsofadose,e.g., exposure to toxicity. A typical non-monotonic relation is U-shaped, that is both verylowandveryhighinstancesofanattributeinvolveanincreasedrisk,whereas values in between are associated with a reduced risk. Examples are weight (both verylowandveryhighweightareassociatedwithanincreasedriskformortality) andsleepingtime(bothveryshortandverylongsleepershaveanincreasedriskfor developingpsychiatricdisorders[22]).Suchrelationscannotbecharacterizedbya global RR value. Instead, tools are necessary that support the hypothesis of a U- shapedrelationbyestimatingtheirparameterswithsomekindofbest-fitalgorithm. 6 AuthorsSuppressedDuetoExcessiveLength 2.2 Image-CentricCohort Studies Image datain epidemiology.The acquisitionof image data is determinedby the available time, by financial resources, by the epidemiological importance and by ethic considerations. Epidemiological studies require approval by a local ethics committee.As a consequence,healthyindividualsin a cohortstudy shouldnotbe exhibitedtoariskassociatedtotheexaminationscarriedout.Thus,MRIshouldbe preferredoverX-rayorCTimagingforitsnon-radiationnature.Petersenandcol- leagues[32]explainwhycardiacCTislessfeasibleinacohortstudyandevenMR isonlyusedwithoutacontrastagentintheirstudyduetoethicalreasons.MRIdata and ultrasound data are the prevailing modalities in both the SHIP as well as the Rotterdamstudy.Unfortunately,MRIandultrasounddatadonotexhibitstandard- ized intensity values(in contrastto CT data). Moreover,MRI andultrasounddata suffer frominhomogeneitiesand variousartifacts. Thus,they are moredifficultto interpretforhumansandmoredifficulttoanalyzewithcomputationalmeans.These data are used to measure, e.g., the thickness of vessel walls, the abdominal aorta diameter and plaque vulnerability in the coronary vessels [22]. The intensive use of MRI in epidemiologicalresearch also explains to some extentwhich questions areanalyzed:MRIisthebestmodalityfortheanalysisofbrainstructuresandthus servesto exploreearly signs of Parkinson’s,Alzheimer’sand otherneurodegener- ativediseases.Epidemiologicalresearchaimsatidentifyingsuchbrainpathologies inapre-symptomaticstage.AmongthesourcesforsuchinvestigationsareMRDif- fusionTensorImagingdatathatenableanassessmentofwhitematterintegrity[22]. The selection of imaging parameters is always a trade-off between conflicting goalsrelatedtoquality,e.g.,imageresolution,signal-to-noiseratio,patientcomfort, e.g., examination time and associated costs. As a consequence, to shorten overall examinationtimesincohortstudyexaminations,notthehighestpossiblequalityis available,i.e.,aslicedistanceof4mmismoretypicalthan1mm.Agreatadvantage ofMRIisthatthismethodisveryflexibleandenablestodisplaydifferentstructures indifferentsequences,suchasT1-,T2-andprotondensity-weightedimaging.MRI dataincohortstudiesoftencomprisemorethantendifferentsequences. Standardization in image acquisition. Due to the rapid progress in medical imaging, sequences, protocols and even (MR) scanners are frequently updated in clinical routine (similar to the update frequency on a computer). These updates wouldseverelyhamperthecomparisonofimagingresultsandthustheassessment ofnaturalchangesanddiseaseoutbreak.Thus,differencesinacquisitionparameters areessentialconfoundingvariables.Therefore,foronecohortandexaminationcy- clethatmaylastuptoseveralyears,noupdatesareallowed.Moreover,allinvolved physiciansandradiologytechniciansarecarefullyinstructedtousethesamestan- dardized imaging parameters. This point is even more important for longitudinal studieswithrepeatedimagingexaminations.EvenifMRscannersandprotocolsare notupdated,thelifecycleofMRcoilsleadstochangesofimagequalitythatneed tobemonitoredandcompensated. VisualAnalyticsofImage-CentricCohortStudiesinEpidemiology 7 2.3 Examplesfor Image-CentricCohort Study Data Inthefollowing,wedescribeselectedcomprehensiveandon-goinglongitudinalco- hortstudies.Bothuseanumberof(epidemiologic)instrumentsthatareinnovative in cohort studies and thus lead already to a large number of insights documented inhundredsof(medical)publications.Aconsiderableportionofthesepublications employ results from imaging data. However,the full potential of analyzing organ shapes,texturesandspatialrelationsquantitativelyisnotexploitedsofar. TheRotterdamstudy.AprominentexampleistheRotterdamStudy1,initiated in1990inthecityofRotterdam,intheNetherlands.Similartolaterstudies,itwas motivatedbythedemographicchangewithmoreandmoreelderlypeoplesuffering fromdifferentdiseasesandtheirinteractions.Aftertheinitialstudyinvolvingalmost 8,000menandwomen,follow-upsatfourpointsintimewereperformed—themost recent examinations took place in the 2009-2011period. In the later examination cycles,alsonewindividualswereinvolvedleadingtodatasetsfromalmost15,000 patients[22]. The original focus of the Rotterdam Study was on neurological diseases, but meanwhileit hasbeenextendedto othercommondiseases includingcardiovascu- larandmetabolicdiseases.Thestudyhasanenormousimpactonepidemiological andrelatedmedicalresearch,documentedin797journalpublicationsregisteredin thepubmeddatabase(searchwithkeyword“RotterdamStudy”,January30,2014). Amongthem are predictionsfor the futureprevalenceof heartdiseases and many studies on potentialrisk factorsfor neurodegenerativediseases. For a comprehen- sive overviewof the findings, see [22] that summarizesthe findings of more than 240 papers related to the Rotterdam Study. In a similar way, [23] is a significant updateofthesefindingswithmorerecentdata. Norwegian Aging Study. A long-term study in Norway investigates the rela- tions between brain anatomy (as well as brain function), cognitive function, and geneticsinnormallyagingpeople.2 Intotal170individuals(120ofthemfemale), aged between 46 and 77 (mean 62), were examined in Bergen and Oslo in by nowthreewaves(1stwavein2004/2005,nextin 2008/2009,andmostrecentlyin 2011/2012)[48].Whilenaturallynotallofthesesubjectscouldbefollowedthrough allthreewaves,stillmostofthemweresubjectedtoanextensivecombinationof 1. neuropsychologicaltests,includingtestsoftheintellectual,language(memory), sensory/motor,andattention/executivefunction, 2. MRI data, including co-registered T1-weighted anatomical imaging, diffusion tensorimaging,and–fromthe2ndwaveon–alsoresting-statefunctionalMRI, aswellas 3. genotyping(1stwaveonly)[46]. 1 http://www.erasmus-epidemiology.nl/research/ergo.htm, accessed: 1/31/2014 2http://org.UiB.no/aldringsprosjektet/,accessed:1/31/2014 8 AuthorsSuppressedDuetoExcessiveLength Thesubstantiallyheterogeneousimagingandtest data are usedto study aging- relatedquestionsaboutthemodernNorwegianpopulation,forexample,howanatom- icalandfunctionalchangesinthehumanbrainpossiblyrelatetothelaterdevelop- mentof Alzheimerand dementia. Importantfindingsinclude the relation between hippocampalvolumesandmemoryfunctioninelderlywomen[48]andtherelation between subcortical functional connectivity and verbal episodic memory function inhealthyelderly[47]. SHIP. The Study of Health in Pommerania (SHIP) is another cohort study broadlyinvestigatingfindingsandtheirpotentialprognosticvalueforawiderange of diseases. The SHIP tries to explain health-related differencesafter the German reunionbetweenEastandWestGermany.Itwasinitiatedin theextremenortheast ofGermany,aregionwithhighunemploymentandarelativelylowlifeexpectancy. Inthefirstexaminationcycle(1997-2001)4,308adultsofallagegroupswereex- amined,followedbyasecondandathirdcyclethatwasfinishedattheendof2012. The instruments used changed over time with some initial image data (liver and gallbladderultrasound)availablealreadyin the firstcycleandothers,in particular wholebodyMRI,addedlater.TheuseofwholebodyMRIwasuniquein2008when the thirdexaminationcyclestarted.Breast MRI forwomenisperformed,whereas for men MR angiographydata are acquired,since men suffer fromcardiovascular diseasessignificantlyearlierthanwomen[42].Inaddition,asecondcohort(SHIP- Trend)wasestablishedcomprising4,420adultparticipants. Diagnosticreportsarecreatedbytwoindependentradiologistswhofollowstrict guidelinestoreporttheirfindingsinastandardizedmanner.Thepilotstudytodis- cusstheviabilityandpotentialofsuchacomprehensiveMRexamisdescribedby [20].Theoveralltimefortheinvestigationistwo(complete)dayswith90minutes fortheMRexam.TheSHIPhelpedtoreliablydeterminetheprevalenceofriskfac- tors,suchasobesity,anddiseases.MajorfindingsoftheSHIPareincreasedlevelsof obesityandhighbloodpressure(comparedtotheGermanpopulation)inthecohort. TheMRexamsaloneidentifiedpathologicalfindingsin35%ofthesamplepopula- tion.Morethan400publicationsinpeer-reviewedjournalsarebasedonSHIPdata (January2014). UKBiobank.TheUKBiobankstartedrecentlyandrepresentsacomprehensive approachtostudydiseaseswithahighprevalenceinanagingsociety,suchashear- ingloss,diabetesandlungdiseases.Halfamillionindividualswillbeinvestigated inoneexaminationcyclefromwhich100,000receiveanMRIfrom2014onwards. TherationaleforthenumberofindividualstobeincludedisexplainedbyPeterson andcolleagues[32]:theyaimatareliableidentificationofevenmoderateriskfac- tors(RRbetween1,3and1,5)fordiseaseswithaprevalenceof5%.Theprospective studyshouldhaveacomprehensiveprotocolofcardiacMRI,brainMRIandabdom- inalMRI.Thisprospectivecohortstudyalsoinvolvesgeneticinformation.3 3http://www.ukbiobank.ac.uk,accessed:1/31/2014 VisualAnalyticsofImage-CentricCohortStudiesinEpidemiology 9 TheGermanNationalCohort.Therecentlystarted“GermanNationalCohort” inGermanyisbasedonexperienceswitha numberofmoderate-sizestudies,such asSHIP,andexaminessome200,000individualsoveraperiodof10-20years.In- dividualswill be invited in three waves to characterizechanges.Due to the large- scale character,imagingis distributedoverfive cities. Thus,the subtle differences in imaging within differentscanners have to be considered.4 It explicitly aims at improvementsin thetreatmentof chronicdiseasesandinvolvesa varietyof tissue samples,e.g.,lymphocytes.Imagingin30,000individualsisagainperformedwith MRI,comprisingwholebody,brainandheart. 2.4 EpidemiologicalData Epidemiologicaldataarehugeandveryheterogeneous.Asanexample,intheUK biobank329attributesrelatetophysicalmeasures,suchaspulserate,systolicand diastolic blood pressure, and variousmeasures relate to vision or hearing. 471 at- tributesrelatetointerviews(socio-demographics,healthhistory,lifestyle,...). The data that are stored per individual is standardized but not completely the same, e.g., childbirth status and menstrual period are available for women only. Image data and derived information, e.g., segmentation results, significantly in- creasedboththeamountandcomplexityofdata.Longitudinalcohortstudydataare time-dependent.Whilesomeinstruments,suchasbloodpressuremeasurements,are availableforallexaminationcycles,otherswereaddedlaterorremoved.Individuals dropout,becausetheymove,dieorjustdonotaccepttheinvitationtoasecondor thirdexaminationcycle.Itisimportanttoconsideralsosuchincompletedatabutto beawareofpotentiallymisleadingconclusions. The great potential of image-centric studies is that image data and associated laboratory data as well as data from interviews are available. An epidemiological study, such as the SHIP, has a large data dictionary that precisely defines all at- tributes and their ranges. While laboratory data are scalar values, most data from interviewsarenominalorordinalvalues.Inparticular,datafrominterviewsexhibit anessentialamountofuncertainty.Self-reportswithrespecttoalcoholanddruguse, cigarettesmokingandsexualpracticesmaybebiasedtowards“expectedorsocially accepted”answers.Epidemiologistsarenotonlyawareoftheseproblemsbutdevel- opedstrategiestominimizethenegativeeffects,e.g.,byaskingredundantquestions. Afterdatacollection,expertsspendalotofefforttoimprovethequalityofthedata. Despitetheseefforts,visualanalyticstechniqueshavetoconsideroutliers,missing anderroneousdata. Geographic data. Geographic data play a central role in public health where the dynamics of local infections are visualized and analyzed (disease mapping). Chuiandcolleagues[9]presentedavisualanalyticssolutiondirectlyaddressingthis 4http://www.nationale-kohorte.de/,accessed:1/31/2014 10 AuthorsSuppressedDuetoExcessiveLength problembycombiningthreededicatedviews.Alsoincohortstudydata,geographic dataarepotentiallyinterestingtounderstandlocaldifferencesinthefrequencyand severityofdiseasesasaninteractionbetweenenvironmentalfactorsandgeneticdif- ferences.Thisbranchofepidemiologyisreferredtoasspatialepidemiology.Beale andcolleagues[5]investigateddifferencesbetweenruralandurbanpopulations.In theircomprehensivesurvey,Jerrettandcolleagues[24]consideredspatialepidemi- ologyasanemergingarea.However,wedonotfocusonspatialepidemiologysince cohortstudydata typicallycompriserathernarrowregionsandthusmay notfully supportsuchanalysisquestions. 2.5 Analysis ofEpidemiologicalWorkflow Thefollowingdiscussionofobservationsandrequirementsforcomputersupportis largelybasedondiscussionswithepidemiologistsaswellastheinspiringpublica- tionbyThewandcolleagues.Accordingto[40] • epidemiologicalhypothesesaremostlyobservationsmadebyphysiciansinclin- icalroutine, • correspondingattributesarechosenbasedontheobservationsandfurtherexpe- rience,and • regressionanalysis is frequentlyused to determinewhetherthe investigatedat- tributeisariskfactorornot. Majorrequirementsforanepidemiologicalworkflow(againbasedon[40])are: • Results have to be reproducible.Due to the iterative data assessment, methods needtobeappliedtonewdatasetsaswellandtheresultsneedtobecomparable betweendifferentassessmenttimestocharacterizethechange.Userinputneeds tobemonitoredallthetimetoenablereproducibleresults. • Amajorresultofanepidemiologicalanalysisiswhethercertainfactorsinfluence adiseasesignificantly.Relativerisk(asameasureofeffectsize)andp-valuesas statisticalsignificancelevelareparticularlyimportant. Although these requirements neither consider image data nor visual analytics, theyhavetobeconsideredalso inthesemoreinnovativesettings.Reproducibility, forexample,meansthatclusteringwithrandominitializationisnotfeasible.More- over, reports must be generated that clearly revealall settings, e.g., parameters of clusteringalgorithmsthatwereusedforgeneratingtheresults. Sincestatisticalanalysisplayssuchanimportantrole,statisticspackages,suchas SPSS5,R6 andSTATA7 dominateinepidemiology.Theyprovidevariousstatistical testsalso incaseswhereassumptions,suchasa normaldistribution,arenotvalid. 5http://www-01.ibm.com/software/analytics/spss/products/statistics/, accessed:1/31/2014 6http://www.r-project.org/,accessed:1/31/2014 7http://www.stata.com/,accessed:1/31/2014