ebook img

Knowledge Discovery from Physical Activity PDF

99 Pages·2017·1.4 MB·English
by  
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Knowledge Discovery from Physical Activity

SusanneJauhiainen Knowledge Discovery from Physical Activity Master’sThesisinInformationTechnology May31,2017 UniversityofJyväskylä DepartmentofMathematicalInformationTechnology Author: SusanneJauhiainen Contactinformation: [email protected].fi Supervisor: TommiKärkkäinenandSamiÄyrämö Title: KnowledgeDiscoveryfromPhysicalActivity Työnnimi: KnowledgeDiscoveryfromPhysicalActivity Project: Master’sThesis Studyline: Laskennallisettieteet Pagecount: 99+0 Abstract: Inthismaster’sthesistheKnowledgeDiscoveryinDatabases(KDD)processand its usage with physical activity data are discussed. The KDD process has multiple steps, including preprocessing, transformation, and data mining. Clustering is used as the data mining technique and is introduced in detail. A large set of different Cluster Validation Indices (CVAIs) and their implementations are tested with the k-means clustering and the best performing ones further generalized. In the empirical part, physical activity data from Finnish seventh-grade students is assessed following the KDD process and using multiple different transformations with different clustering methods. The aim is to find out, whether unsuperviseddataminingcanhelpdetectnovelandusefulinformationfromthisdata. Keywords: Knowledgediscovery,physicalactivity,clustervalidationindex Suomenkielinen tiivistelmä: Tässä pro gradu -tutkielmassa käydään läpi Knowledge Dis- covery in Databases (KDD) -prosessi ja sen soveltamismahdollisuuksia fyysiseen aktiivisu- uteenliittyvändatankanssa. KDD-prosessikoostuumonestaerivaiheesta,sisältäenesikäsit- telyn,datanmuunnoksenjatiedonlouhinnan. Tässätutkielmassatiedonlouhinnanmenetelmänä käytetään klusterointia, joka käydään läpi yksityiskohtaisesti. Vertailemme myös laajan joukon eri klusterointi-indeksejä (CVAIs) sekä niiden eri toteutuksia k-means klusteroinnin kanssa ja esittelemme parhaat näistä yleisemmässä muodossa. Tutkielman empiirisessä os- assa seitsemäsluokkalaisten koululaisten aktiivisuusdataa tutkitaan KDD-prosessia seuraten i ja hyödyntäen monia eri datan muunnoksia ja klusterointimenetelmiä. Tarkoituksena on selvittää, voiko ohjaamattoman tiedonlouhinnan avulla löytää uutta ja hyödyllistä informaa- tiotadatasta. Avainsanat: Knowledgediscovery,fyysinenaktiivisuus,klusterointi-indeksi ii Glossary KDD KnowledgeDiscoveryfromDatabases EDA ExploratoryDataAnalysis LDA LinearDiscriminantAnalysis SVM SupportVectorMachine CPM CountsPerMinute CVAI ClusterValidationIndex IoT InternetofThings QS QuantifiedSelf PCA PrincipalComponentAnalysis MET MetabolicEquivalentofTask MVPA ModeratetoVigorousPhysicalActivity iii List of Figures Figure1.TheoriginalKDD-process. FromFayyad,Piatetsky-Shapiro,andSmyth1996b. 4 Figure2.Withboxplotonecangetaquickoverviewofthedata. Theupperandlower bounds of the box are the third and first quartile, the line inside the box is the median,thewhiskersrepresentmaxandminvalues,andtheplussignsstandfor outliers. ThisboxplotisderivedfromthesedentarydataofstudentsinChapter6. .13 Figure3.Datawithtwoclasses,blackcircleandsquarearetobeclassified................16 Figure4.Fruitclassificationusingdecisiontree..............................................17 Figure 5. On the right LDA with vector w and on the left SVM, where grey markers are the support vectors and 1 is the margin. Adapted from (Zaki and Meira Jr (cid:107)w(cid:107) 2014) .....................................................................................17 Figure 6. A line fitted to descripe the relationship between delivery time and delivery volume. AdaptedfromMontgomery,Peck,andVining2015.........................18 Figure 7. On the left, data drawn from Gaussian distribution and on the right from Laplaciandistribution. FromÄyrämö2006,pp. 56....................................20 Figure8.Density-baseddatasetwithtwonon-linearlyseparableclusters. Fromhttps://en.wikipedia.org/wiki/DBSCAN.21 Figure 9. The DBSCAN algorithm when minpts=3, where the black circles are core points, white ones are border points and the black square is noise. The lines implicateadistancelessthanorequaltoε. ............................................23 Figure10.TheS-datasetswith15clustersandincreasingnoise.............................44 Figure11.Fromlefttoright: S1D2,S2D2,andS5D2........................................45 Figure12.Averagecountsperminuteforhoursoftheweekfromonestudent.............51 Figure13.Availabledatafortheonehourperiods............................................52 Figure14.Availabledataforthehalf-an-hourperiods........................................53 Figure 15. Cluster prototypes with one hour periods considering entire time. Cluster oneisdarkblue,clustertwoislightblueandclusterthreeisred......................54 Figure16.Clusterprototypesforthemostseparatingtimes..................................55 Figure17.Spatialmedianforeachvariableonthex-axisandthe χ2-valueofKruskal- Wallis on the y-axis. The line was manually inserted and variables above it are themostseparating.......................................................................55 Figure 18. Cluster prototypes with half hour periods considering school time. Cluster oneisdarkblue,clustertwoislightblue,clusterthreeisorangeandclusterfour isred. .....................................................................................57 Figure19.Clusterprototypesforthemostseparatingtimes..................................58 Figure20.Spatialmedianforeachvariableonthex-axisandthe χ2-valueofKruskal- Wallis on the y-axis. The line was manually inserted and variables above it are themostseparating.......................................................................58 Figure 21. Part of activity counts form a student. CPM less than 100 is considered sedentarytime............................................................................61 Figure 22. CVAIs for transformation 1 that support the number of clusters being 5. In y-axis the value for CVAI and in x-axis the number of clusters, K. From left to right the CVAIs are Silhouette, Wemmert Gançarski, and Davies Bouldin. The proposednumberistheminimumindexvalue. ........................................64 iv Figure 23. The results from hierarchical prototype-based clustering. First five cluster wereformedandthenthetwolargestoneswereclusteredagaintofindsubgroups (Wartiainen and Kärkkäinen 2015; Saarela and Kärkkäinen 2015). The sizes of clustersaremarkedonthefigureaswellastheindicesproposingthenumberof subclustersfound.........................................................................65 Figure24.CVAIsfortransformation2.3thatsupportthenumberofclustersbeing10. In y-axis the value for CVAI and in x-axis the number of clusters, K. From left to right the CVAIs are Wemmert Gançarski, Ray Turi and Davies Bouldin. The proposednumberistheminimumindexvalue. ........................................66 Figure 25. Transformation 1, 10 most separating variables. For example, the most separatingvariableisasedentaryperiodoflength65minutes,mostlypresentin cluster11. Thisvariablecanbeconsideredcharacterizingthiscluster,sinceitis theonemostlyseparatingitfromtheothers............................................67 Figure26.Transformation1,portionofaweekdayinacluster..............................67 Figure 27. Transformation 2.3, 10 most separating variables. For example, the most separatingvariableisasedentaryperiodoflength31minutes,mostlypresentin cluster 4. This variable can be considered characterizing this cluster, since it is theonemostlyseparatingitfromtheothers............................................70 Figure 28. The difference between prototypes and the whole data median. C1 is blue, C2isgreen,andC3isyellow............................................................72 List of Tables Table1.Anexampledatabasewithfivetransactionscontainingfiveitems..................14 Table2.NumbersofclusterssuggestedbytheCVAIs .......................................47 Table3.Distributionofthemeasurementdays................................................50 Table4.Variableschosenforclustering.......................................................53 Table 5. Metadata for clusters with one hour periods considering entire time. Total activityisthesumofallactivityovertheweek.........................................56 Table 6. Metadata for clusters with half-hour periods considering school time. Total activityisthesumofallactivityovertheweek.........................................59 Table7.Exampletableofsedentaryperiodsofthestudents. ................................62 Table8.Metadataforclustersobtainedwithtransformation1...............................68 Table9.Metadataforclustersobtainedwithtransformation2.3.............................69 Table10.Metadataforclusters. ...............................................................71 v Contents 1 INTRODUCTION ....................................................................... 1 2 THEKDDPROCESS ................................................................... 4 2.1 Datamatrix ........................................................................ 5 2.2 Preprocessing...................................................................... 6 2.3 Transformation .................................................................... 7 2.3.1 Dimensionreduction ....................................................... 8 2.4 Datamining........................................................................10 2.4.1 Exploratorydataanalysis...................................................12 2.4.2 Frequentpatternmining....................................................13 2.4.3 Classification................................................................15 2.4.4 Regression...................................................................18 2.5 Summary...........................................................................19 3 CLUSTERING...........................................................................20 3.1 Whatisacluster...................................................................20 3.2 Hierarchicalclusteringmethods..................................................22 3.3 Density-basedclusteringmethods................................................22 3.4 Prototype-basedclusteringmethods..............................................23 3.4.1 Clusterinitialization ........................................................24 3.5 Robustclustering..................................................................25 3.6 Clustervalidation..................................................................25 3.7 Summary...........................................................................30 4 MEASURINGANDQUANTIFYINGPHYSICALACTIVITY .....................32 4.1 Accelerometers....................................................................33 4.1.1 ActiGraphandcounts ......................................................35 4.2 Gyroscopes ........................................................................35 4.3 Pedometers ........................................................................36 4.4 Heartratemonitors................................................................36 4.5 Visionsensors .....................................................................36 4.6 GlobalPositioningSystem........................................................37 4.7 Physicalactivitydataapplications................................................37 4.7.1 Humanactivityrecognition ................................................38 4.7.2 Gaitanalysis ................................................................39 4.7.3 Monitoringandmedicaldiagnosis.........................................40 4.7.4 Activityandsports..........................................................40 4.8 Activitymonitors..................................................................41 4.8.1 Consumer-basedphysicalactivitymonitors...............................42 4.8.2 Validityofconsumer-basedphysicalactivitymonitors...................42 4.9 Summary...........................................................................43 5 CVAITESTS.............................................................................44 vi 5.1 Testdatas ..........................................................................44 5.2 Methods............................................................................45 5.3 Results .............................................................................45 5.4 Conclusion.........................................................................46 6 KNOWLEDGEDISCOVERYFROMTHEACTIVITYOFFINNISHSCHOOL CHILDREN..............................................................................49 6.1 RecommendationsandtheFinnishschoolsonthemoveprogram..............49 6.2 Studypopulationanddatacollecting.............................................50 6.3 Preprocessing......................................................................51 6.4 Weeklyphysicalactivityofthestudents .........................................51 6.4.1 Transformation..............................................................51 6.4.2 Datamining.................................................................53 6.4.3 Results.......................................................................54 6.4.4 Entiretimewithonehourperiods .........................................54 6.4.5 Schooltimewithhalf-an-hourperiods ....................................57 6.4.6 Conclusion ..................................................................60 6.5 Sedentarybehaviourofthestudents..............................................60 6.5.1 Transformation..............................................................60 6.5.2 Datamining.................................................................64 6.5.3 Results.......................................................................66 6.5.4 Transformation1............................................................66 6.5.5 Transformation2.3..........................................................69 6.6 Sedentarybehaviourofthestudents2............................................70 6.6.1 Transformation..............................................................70 6.6.2 DataMining.................................................................71 6.6.3 Results.......................................................................71 6.7 Summary...........................................................................73 7 CONCLUSION ..........................................................................75 BIBLIOGRAPHY..............................................................................77 vii 1 Introduction Physical activity can be defined as "any bodily movement produced by skeletal muscles that requires energy expenditure" (Caspersen, Powell, and Christenson 1985). The amount and form of daily physical activity, over the whole life span, can greatly affect individuals health and quality of life, which makes the assessment of it very important. Active lifestyle can, for example, reduce the risk of coronary heart disease, hypertension, diabetes, some cancers, and premature mortality in general (Health and Services 1996). According to the World Health Organization (WHO), physical inactivity is the fourth biggest global risk for mortality,responsiblefor5.5%-over3million-ofalldeaths(WHO2009). As the technical abilities to measure activity’s different components and recognize its dif- ferent forms have advanced, many applications and studies in a variety of domains have emerged. For example in gerontology, monitoring the activity of elderly people can be used todetectiftheyfall(Mubashir,Shao,andSeed2013)oriftheyhavechestpainorheadache (KhanandSohn2011). Thisnondiscruptivemonitoringcanbedonefromoutsidetheirfree- living conditions, which improves the quality of their lifes and saves elderly care resources by supporting home care. In sports the foot-ground contact time and the stride ratecadence can be analysed when running and utilized in maximizing the performance (Morin et al. 2007;Weyandetal.2000). Heartratemonitoringhasalsobecomeverygeneralandcangive alotofinformationabout,forexample,theintensityofphysicalactivity(Vesterinen2016). There are many different components of human activity that can be measured with a variety of devices such as accelerometers or heart rate monitors. For example, the total physical activity, the duration, frequency and intensity of physical activity, energy expenditure, as well as number of steps, speed and distance when walking (Butte, Ekelund, and Westerterp 2012) can be determined from the raw sensory measurements. Furthermore, the locomotive activities (e.g. walking, jogging, running) can be classified into their own categories (Bao andIntille2004;Kwapisz,Weiss,andMoore2011). Measuringofthesecomponentsofhumanactivitycanbedoneobjectivelywithmanysensors which can be divided into external and wearable ones (Lara and Labrador 2013). Firstly, 1 inertial sensors are wearable sensors based on inertia, with accelerometers being the most widely used example (Avci et al. 2010). Acceleration data can be used to estimate, e.g., the intensityofphysicalactivityovertime(ChenandBassett2005). Secondly, sensors for physiological signals can be used in measuring the activity. These are normally wearable sensors. Heart rate monitors, for instance, have developed rapidly in recent years and are today very common in sports and training. They are mainly used to determinetheexerciseintensity(AchtenandJeukendrup2003),butheartratevariabilitycan also be used to analyze how well an athlete is adapting to endurance training (Vesterinen et al. 2013) or in prevention and detection of overtraining (Achten and Jeukendrup 2003). Thirdly, vision sensors are external sensors used, for example, in gait analysis and other machinevisionapplications(Poppe2007). With the increase of different human activity sensors and monitors, massive streams of data about our lives are being created continously. This has led to suggestion of the concept InternetofHumans(Arbiaetal.2015),inconnectionwiththeInternetofThings(IoT).While IoT is the general term referring to the things and devices being connected to the internet, IoH can be thought as humans being connected to internet by using different monitors and loading data about themselves to the internet. This personal data can also be referred to as MyData(PoikolaandHonko2010). Anemergingtrendistotalkaboutsocalledquantified- self (QS), which has been definded "as any individual engaged in the self-tracking of any kindofbiological,physical,behavioral,orenvironmentalinformation"(Swan2013). Asthe amountofdatagrows,newapproachesareneededtoutilizeitmoreeffectively. Knowledge Discovery in Databases (KDD) is a process of finding valid and novel informa- tionfromlargeamountsofdata(Fayyad,Piatetsky-Shapiro,andSmyth1996b). Ithasmany steps, beginning from preprocessing the data, transforming it and finally by using data min- ing techniques in order to find patterns that can be interpreted into knowledge. Compared to the more traditional analysis techniques, data mining is rather about finding novel and unexpected information from the data than about confirmation of some existing hypothe- ses (Hand, Mannila, and Smyth 2001). KDD’s potential to utilize the large datasets, that are nowcommoninthefieldofphysicalactivity,makesitanaturalframeworkfortheknowledge discoveryfromsuchsources. 2

Description:
including preprocessing, transformation, and data mining. physical activity through a variety of sensors and some example applications .. and a common example of frequent itemset mining is market basket analysis (Aggarwal and “Gait analysis using a shoe-integrated wireless sensor system”.
See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.