Table Of Content

Yuetal.EURASIPJournalonAdvancesinSignalProcessing (2019) 2019:14 EURASIP Journal on Advances https://doi.org/10.1186/s13634-019-0612-x in Signal Processing RESEARCH Open Access A multisource fusion framework driven by user-defined knowledge for egocentric activity recognition Haibin Yu1 , Wenyan Jia2, Zhen Li3, Feixiang Gong4, Ding Yuan5, Hong Zhang5 and Mingui Sun6* Abstract Recently, egocentric activityrecognitionhas attracted considerable attention inthe pattern recognitionand artificial intelligencecommunitiesbecauseofitswidespreadapplicabilitytohumansystems,includingtheevaluationofdietary andphysicalactivityandthemonitoringofpatientsandolderadults.Inthispaper,wepresentaknowledge-driven multisourcefusionframeworkfortherecognitionofegocentricactivitiesindailyliving(ADL).Thisframeworkemploys Dezert–Smarandachetheoryacrossthreeinformationsources:thewearer’sknowledge,imagesacquiredbyawearable camera,andsensordatafromwearableinertialmeasurementunitsandGPS.Asimplelikelihoodtableisdesignedto provideroutineADLinformationforeachindividual.Awell-trainedconvolutionalneuralnetworkisthenusedtoproduce asetoftextualtagsthat,alongwithroutineinformationandothersensordata,areusedtorecognizeADLsbasedon informationtheory-basedstatisticsandasupportvectormachine.Ourexperimentsshowthattheproposedmethod accuratelyrecognizes15predefinedADLclasses,includingavarietyofsedentaryactivitiesthathavepreviouslybeen difficulttorecognize.Whenappliedtoreal-lifedatarecordedusingaself-constructedwearabledevice,ourmethod outperformspreviousapproaches,andanaverageaccuracyof85.4%isachievedforthe15ADLs. Keywords:Egocentric activity recognition, Activity of daily living, Multisource fusion, Knowledge-driven model, Dezert–Smarandache theory 1 Introduction [6]. Egocentric activity recognition has now become a In recent years, a variety of camera-based smart wear- major topic of research in the fields of pattern recogni- able devices have emerged in addition to smart watches tion andartificial intelligence[7,8]. and wristbands, such as Google Glass, Microsoft Sense- Traditional methods of egocentric activity recognition Cam, andNarrative. These wearablesusually contain not often utilize motion sensor data from the IMU only and only a camera, but also other sensors such as inertial process these data using conventional classificationtech- measurement units (IMUs), global positioning system niques [9]. However, the performance of motion-based (GPS), temperature sensors, light sensors, barometers, methods depends on the location of the IMU sensor on and physiological sensors. These sensors automatically the body, and the classification accuracy tends to be collect video/image, motion/orientation, environmental, lower when used to distinguish more complex activities and health data. Because these data are collected from in daily living (ADL), especially for certain sedentary the viewpointof the wearer, they are called egocentric or activities. A wearable camera can provide more ADL in- first-person data. Tools for the automatic analysis and formation than motion sensors alone. Therefore, vision- interpretation of egocentric data have been developed based activity recognition using a wearable camera has and applied to healthcare [1, 2], rehabilitation [3], smart become the focus of research in the field of egocentric homes/offices [4], sports [5], and security monitoring activityrecognition[10,11]. Inrecentyears,withthecontinuousdevelopmentofthe deep learning framework, the accuracy of image/video *Correspondence:[email protected] 6DepartmentofNeurologicalSurgery,UniversityofPittsburgh,Pittsburgh, recognition has been improved greatly, and numerous USA vision-based activity recognition methods, such as deep Fulllistofauthorinformationisavailableattheendofthearticle ©TheAuthor(s).2019OpenAccessThisarticleisdistributedunderthetermsoftheCreativeCommonsAttribution4.0 InternationalLicense(http://creativecommons.org/licenses/by/4.0/),whichpermitsunrestricteduse,distribution,and reproductioninanymedium,providedyougiveappropriatecredittotheoriginalauthor(s)andthesource,providealinkto theCreativeCommonslicense,andindicateifchangesweremade. Yuetal.EURASIPJournalonAdvancesinSignalProcessing (2019) 2019:14 Page2of23 learning, haveemerged [12–14].Ithasbeen reportedthat with a fixed combination of objects. As a result, abun- deep learning achieved a performance improvement of dant information about when, where, and how ADLs roughly 10% over the traditional trajectory tracking occur can be used to establish a knowledge base. There- methods [14]. Although there has been significant pro- fore, for ADL recognition, the knowledge-driven model gress in egocentric ADL recognition, the performance of is more intuitive and potentially powerful. Although no vision-based methods is still subject to a number of con- special knowledge-driven model for egocentric ADL re- straints, such as the location of the wearable camera on cognition currently exists, some knowledge-driven thehumanbody,imagequality,variationsinlightingcon- models have been established in fields such as ADL ditions, occlusion, and illumination. In practical applica- recognition in smart homes, e.g., descriptive logic model tions, no single sensor can be applied for all possible [20], event calculus model [21], and activity ontology conditions.A common practice to avoid the risk of mis- model [22]. Although these models offer semantic clarity recognition by a single sensor is to fuse multiple recog- and logical simplicity, they are usually complex. Users nition results for the same target from different must contact the developers to convert their own daily sensors. Therefore, efforts have been made to combine routines into model parameters. Considering that this vision and other sensor data for egocentric ADL recog- kind of model is best created by the wearers themselves, nition. For example, egocentric video and IMU data the current methods for knowledge representation re- captured synchronously by Google Glass were used to quire substantial simplification to improve their usability recognize a number of ADL events [15]. Multiple andadaptabilityfor egocentric ADLrecognition. streams of data were processed using convolutional Inthis study,wepropose a new knowledge-drivenmul- neural networks (CNNs) and long- and short-term tisourcefusionframeworkforegocentricADLrecognition memory (LSTM), and the results were fused by max- andapplyittoegocentricimagesequencesandothersen- imum pooling. The average accuracy for 20 distinct sor data captured by a self-developed chest-worn device ADLs reached 80.5%, whereas using individual video (eButton) [23] for diet and physical activity assessment. and sensor data only yielded accuracies of 75% and Themaincontributionsofthisstudyareasfollows: 49.5%, respectively. In [16], the dense trajectories of egocentric videos and temporally enhanced trajectory- (1) Aknowledge-driven multisourcefusionframework like features of sensor data were extracted separately based onDSmTisestablishedforthefusion ofprior and then fused using the multimodal Fisher vector ap- knowledge,vision-basedresults,and sensor-based proach. The average recognition accuracy after fusion results.This frameworkenablestheaccuraterecog- was 83.7%, compared to 78.4% for video-only and nitionofupto15kindsofADLs,includingavariety 69.0% for sensor-only data. These results show that, for ofsedentary activities thatare hardtorecognize egocentric ADL recognition, it is beneficial to integrate usingtraditional motion-basedmethods, e.g.,com- IMU sensors and cameras at both the hardware and puteruse, meetings, reading,telephoneuse,wat- algorithm levels. chingtelevision,andwriting. Some commonly used multisource fusion methods (2) Theproposedknowledge-drivenADLmodel canbe include Bayesian reasoning, fuzzy-set reasoning, expert established bythe deviceuser.Previously,users systems, and evidence theory composed of Dempster– wererequiredtoconsultwithan expertwhocould Shafer evidence theory (DST) [17] and Dezert–Smaran- representthe user’slifeexperiencequantitatively dache theory (DSmT) [18]. Among these methods, DST usingcertain indexvalues.Our framework simpli- andDSmThaveasimpleformofreasoningandcanrep- fiesthisprocesssignificantly,allowingindividuals to resent imprecise and uncertain information using basic expresstheir ADLroutinesusingasetofsimple belief assignment functions, thus mimicking human association tables. thinking in uncertainty reasoning. By generalizing the (3) Anovelactivity recognition algorithm basedon discernment framework and proportionally redistribut- egocentric imagesisproposed.Withthe helpof ing the conflicting beliefs, DSmT usually outperforms “bagsoftags”determinedbyCNN-based automatic DST when dealing with multisource fusion cases with imageannotation, thecompleximage classification conflictingevidencesources. taskisreducedto atext classificationproblem. In egocentric ADL recognition using evidence theory, Furthermore,theentropy-based termfrequency- anactivity model isoften requiredto convert the activity inversedocumentfrequency(TF-IDF)algorithmis data or features from different sources to the basic belief usedto perform featureextraction andADL assignment (BBA). Generally, activity models can be di- recognition. vided into two types: data-driven and knowledge-driven [19]. Most ADLs have certain regularities because they The remainder of this paper is organized as follows. occur at a relatively fixed time and place, and interact Our methods for ADL recognition are described in Yuetal.EURASIPJournalonAdvancesinSignalProcessing (2019) 2019:14 Page3of23 detail in Section 2. A series of experimental results de- routines far better than other people or a computer. monstrating the performance of the proposed frame- Therefore, we develop a knowledge-driven ADL model work are presented in Section 3. The comparison with that can be established by the user of a wearable device. existing methods is shown in Section 4. Finally, we con- Previously, such a model would require the person to clude this paper in Section 5 by summarizing our consult an expert who represents the user’s life experi- approach and results and discussing some directions for ence quantitatively using certain index values [20–22]. futureresearch. In our framework, we simplify this process significantly to allow individuals to express their ADL routines using 2 Methods a setofsimpleassociationtables. Our multisource ADL recognition method is illustrated Let us consider r sources of information ɛ , ɛ , …, ɛ. 1 2 r in Fig. 1. Conceptually, it consists of four main compo- As each source may contain multiple information en- nents: (1) basic information about the ADL routines of tities, each source ɛ is represented as avector. With this i an individual (the user of the wearable device) is definition, we represent pairwise relationships (ɛ, ɛ) i j acquired using a “condition–activity” association table, from the r sources as a rectangular matrix. The matrix (2) a CNN-based automatic image annotation pre-classi- entry in row ɛ and column ɛ expresses the strength (a i j fies the textual results using an entropy representation, positive number) of the relation between these two ele- (3) a set of motion and GPS data is processed and ments. As the relationship between the two elements is pre-classified using a support vector machine (SVM), not commutative, i.e., A leads to B does not imply B and (4) a final classification is performed analytically by leads to A, the relationship matrix for (ɛ, ɛ) is generally i j fusing the pre-classified results represented in terms of asymmetric. As an important special case, (ɛ, ɛ) for i=j i j BBAsbasedontheDSmTframework. represents the relationships among the elements of ɛ. i According to Zintik and Zupan [24], all (ɛ, ɛ) can be i j 2.1BBAofuserknowledge tiled intoalarge,sparseglobal matrix. It is widely accepted that “the person who knows you Asourknowledge-drivenmodelrunsundertheframe- the best is yourself,” although this is not universally true work of the Dezert–Smarandache theory, all activity-re- (e.g., a doctor may know better regarding illnesses). lated conditions (e.g., time, place, order of occurrence) Nevertheless, people know their own lifestyle and ADL must be specified through the construction of numerical Fig.1Architectureoftheproposedmethod Yuetal.EURASIPJournalonAdvancesinSignalProcessing (2019) 2019:14 Page4of23 Table1Sampletime–activitytable Timeperiod Cleaning Computeruse Eating Entertainment Lyingdown Meeting Reading …* WatchingTV Writing 0:01–6:50 0 0 0 0 10 0 0 … 0 0 6:51–7:20 2 2 10 0 0 0 3 … 0 1 7:21–7:30 0 0 0 0 0 0 0 … 0 0 … … … … … … … … … … … 21:01–22:00 2 10 1 0 0 0 3 … 9 2 22:01–00:00 1 10 2 0 5 0 0 … 9 0 *Sixcolumns(indicatedby“…”)areomittedinthetable,namely“shopping,”“talking,”“telephoneuse,”“transportation,”“walkingoutside,”and“washingup” BBAs. Thus, if we view the ADLs and the conditions as combination of objects (CoO) [25, 26]. For example, different information sources, we can use the above the- “computer use” is likely to have a CoO consisting of a oretical framework to represent ADLs in relationships computer, monitor, screen, keyboard, and table. When with certain conditions, including their time, place, and thisCoO isfully orpartially observed, theunderlying ac- order of occurrence, and then fill the pairwise matrices tivity canbeguessed with acertaindegree ofconfidence. (or tables) numerically. In our application, we require a In this study, the two main steps for ADL recognition simple and intuitive form that can be used by individ- using the CoO concept are (1) extraction of CoO and uals. Therefore, we design each matrix as an association (2)construction ofanADLclassifier.Thesestepsarede- table containing integer values from 0 (impossible) to 10 tailed below. (assured). For example, to represent one’s ADLs at dif- ferentclocktimes,ahypothetical individual’stime–activ- 2.2.1SemanticfeatureextractionbyCNN ity table is presented in Table 1. In this table, the wearer In this study, we are mainly concerned with whether can adjust the time period according to his/her daily ADL-related objects are present in the input image, ra- routine, especially activities with relatively clear start ther than their order of presentation (although the order times, such as getting up, starting work, leaving work, may also carry some information). Ignoring the order, and sleeping. Multiple time–activity tables may be re- we perform the CoO detection task in two steps. In the quired for weekdays and weekends/holidays (see the ex- first step, all objects in the input image are detected and amples in Appendixes 1 and 2). Similarly, a location– represented in the form of a textual list. This is essen- activity table and an activity transition table (i.e., a table tially a process of automatic image annotation. In the specifying the previous activity and the current activity) second step, we check whether there is a CoO corre- can be designed to further enrich the knowledge-driven spondingtoaparticular ADLinthe list. model. Our experiments indicate that such tables can be Recently, with the continuous development of the completedquickly withlittle training. deep learning framework, automatic image annotation Considering that the BBA value for each activity can produce impressive image annotation results with should be between 0 and 1 (see Section 2.4), we apply the aid of well-trained CNNs. A CNN is a class of deep, row-wise normalization according to the sum of all inte- feed-forward artificial neural networks that generally in- ger values in that row. For the example in Table 2, if the clude a convolutional layer, a pooling layer, and a fully clock time is 21:18:00, the corresponding BBA is con- connectedlayer.Somewell-knownpre-trainedCNNsin- structed by dividing all integer values in the “21:01– clude AlexNet [27], VGGNet [28], and ClarifaiNet [29, 22:00”rowbythesumofthese values. 30] which are pre-trained using a large image database such as ImageNet [31]. The typical process of automatic 2.2BBAofimages image classification andannotation using the pre-trained In our case, activity recognition from egocentric images CNN is shown in Fig. 2 (considering the VGG-16 net- must be performed indirectly, because the person wear- work in VGGNet as an example). The output of the ing the camera is unlikely to appear in the images. We automatic image annotation is a series of textual tags, perform the recognition task using the concept of a which can be defined as “bag of tags” (BoTs). As the Table2BBAvaluesoftheuser-providedknowledgeofADLs,basedonTable1andatimestampof21:18:00 Timeperiod Cleaning Computeruse Eating Entertainment Lyingdown Meeting Reading …* WatchingTV Writing 21:01–22:00 0.0488 0.2439 0.0244 0 0 0 0.0732 … 0.2195 0.0488 *Sixcolumns(indicatedby“…”)areomittedinthetable,namely“shopping,”“talking,”“telephoneuse,”“transportation,”“walkingoutside,”and“washingup” Yuetal.EURASIPJournalonAdvancesinSignalProcessing (2019) 2019:14 Page5of23 Fig.2Thetypicalprocessofautomaticimageclassificationandannotationusingpre-trainedCNN BoTs are extracted from a specific image, it can be “laptop,” “keyboard,” “internet”} and “eating” corre- regardedasthehigh-levelsemantic featureoftheimage. sponds to the set {“food,” “meat,” “cooking,” “plate”}. After comparison, we find that the textual tags ex- Table 3 also indicates that both sets contain some less tracted by ClarifaiNet are more consistent with the ob- general or non-distinctive tags such as “no person,” jects in the images of our egocentric dataset. Therefore, “people,” and “indoors.” Moreover, there may be substan- we use ClarifaiNet and adopt a process exemplified in tial differences among the tags extracted from the same Fig. 2 to obtain the BoTs of each frame in the egocentric activity class because of different image contexts and ac- imagesequence,i.e., quisitionparameters(e.g.,distance,viewangle).Therefore, (cid:1) (cid:3) the classification accuracy depends on selecting tags that BoTs ¼CNN ðI Þ¼ Ti;Ti;…;Ti ð1Þ notonlydescribethetargetactivitywithinaclass,butalso i ClarifaiNet i 1 2 L distinguishactivitiesacrossclasses. where I is the ith frame in the image sequence, T is WiththeBoTsconstructedinthisway,ADLrecognition i the extracted tag, and L is the number of tags extracted fromegocentricimagesbecomesasemantictextualclassi- from one frame of the image (when using ClarifaiNet, ficationtask.Weapproachthistaskusingthevectorspace the default value of L is 20). An example of BoTs is model [32] to represent BoTs and establish a text classi- showninTable3,andtheimagescorrespondingtothese fier. First, we compute the term frequency-inverse docu- BoTsareshown inFig.3. ment frequency (TF-IDF) measure, which is widely used forweightingtextualfeatures,givenby[33] 2.2.2BBAconstructionfromBoTs Aresgarmdeedntaiosnaedhigahb-olevvee,l CseNmNan-ptircodfeuacteudreBfrooTms tchaenspbee- tfi;j(cid:2)idfi ¼Pnkin;jk;j(cid:2) log(cid:4)(cid:4)(cid:1)j:ti∈djj;Ddjj∈D(cid:3)(cid:4)(cid:4)þ1 ð2Þ cific egocentric image. Hence, it can be used in the classification of the ADL corresponding to the image. For where tf and n denote the term frequency and num- i, j i, j example, the tags in Table 3 correspond to two ADLs, ber of occurrences of t in document d, respectively; i j “computer use” and “eating.” We can select certain key- ∑ n is the sum of the occurrences of all terms in k k, j words to represent these activities, e.g., “computer use” document d (i.e., the total number of terms); idf is the j i can be represented by the set {“computer,” “technology,” inverse document frequency (a measure of whether the Table3BoTsproducedbyClarifaiNetfortheegocentricimagesinFig.3 Image BoTs no. 1 2 3 4 5 6 7 … a Computer Technology Business Laptop People Indoors Keyboard … b Computer Technology Keyboard Internet Laptop Business Electronics … c Room Noperson Table Business Computer Indoors Office … d Computer Technology Laptop Internet Noperson Business Keyboard … e Food People Knife Indoors Meat Restaurant Cooking f Food Noperson Meat Fish Dinner Meal Plate … g Food Indoors People Knife Sugar Fruit Cooking … h People Indoors Container Drink Food Table Tableware … Yuetal.EURASIPJournalonAdvancesinSignalProcessing (2019) 2019:14 Page6of23 Fig.3Examplesofegocentricimagesofdifferentactivities.a–daretheegocentricimagesof“computeruse”;e–haretheegocentricimagesof“eating” (cid:5) (cid:6) X(cid:7) (cid:8) term is common or rare across all documents) of term C B ;T ¼ T ¼¼B ðlÞ ; B ∈B ; ð4Þ t; |{j:t ∈d, d ∈D}| is the number of documents con- j i l i j j Ak i i j j taining term ti in document set D; and |D| is the num- where the double equation signs denote “whether the ber of documents in D. Note that (2) does not apply to two operands are equal,” resulting in a binary output for the casewherethedocument setcontains differenttypes the bracketed variable. Using (4), CðB ;T Þ can be Ak i of documents, i.e., it cannot be used directly to classify a expressedas BoT set containing different ADLs. To apply TF-IDF to document sets containing multiple types of documents, a XjAkj (cid:5) (cid:6) CðB ;T Þ¼ C B ;T : ð5Þ number of modified algorithms have been developed, in- Ak i j i cludingbidirectionalnormalizationforthetermfrequency j¼1 [34], constraints imposed by the mutual information [35], The intra-class entropy of T for B , called e2, can be i A i and the application of information entropy [36]. The defined as entropy-based TF-IDFgenerally provides betterclassifica- tion because the statistical features of the terms among XK DðB ;T Þ DðB ;T Þ e2 ¼− Ak i (cid:2) log Ak i ð6Þ different types of documents can be well-represented by i DðB ;T Þ 2 DðB ;T Þ theinformationentropy.Wemodifytheentropyapproach k¼1 A i A i byaddinganinter-classentropy factor e1i, kand anintra- where DðBAk;TiÞ is the number of BoTs containing tag fcilearsstoen“tcroompypafacct”totrhee2iinttora(-2c)l.asTshaisctaivlliotiwess wthheilBeo“Tsecplaarsasti-- TiinsubsetBAk,d(cid:4)(cid:1)efined as (cid:3)(cid:4) ing”inter-classactivities,asdescribedbelow. DðBAk;TiÞ¼(cid:4) j:Ti∈Bj; Bj∈BAk (cid:4): ð7Þ Assuming that the total number of the ADLs to be From this definition of DðB ;T Þ, we can express classified is K, the corresponding egocentric image set is Ak i A={A ,A , …,A }. For the kth activity A ∈A, the total D(BA,Ti)as 1 2 K k number of images is |Ak| and all BoTs extracted from XK each image in Ak constitute the BoT subset BAk ¼fB1; DðBA;TiÞ¼ DðBAk;TiÞ B2;…;BjAkjg. For the BoTset of A, we thePn have BA ¼f Xk¼K1(cid:4)(cid:1) (cid:3)(cid:4) BsuAm1;eBAth2;a…t t;hBeArke;…ar;eBNAK−u1n;BiqAuKegtwagitshTjA=j{¼T ,TKk¼,1…jA,kTj.}Aisn- ¼ (cid:4) j:Ti∈Bj; Bj∈BAk (cid:4): ð8Þ 1 2 N k¼1 B . For any tag T ∈T, its inter-class entropy factor for A i It can be observed from (3) that e1 is used to B ,callede1 ,canbedefined as i, k Ak i,k describe the distribution of tag T in B , which cor- i Ak (cid:5) (cid:6) (cid:5) (cid:6) responds to the particular activity Ak. Moreover, the XjAkj C B ;T C B ;T more uniform the distribution of T in B , the lar- e1i;k ¼− j¼1CðBAjk;TiiÞ(cid:2) log2CðBAjk;TiiÞ ger the value of e1i, k and, consequiently, Atkhe greater the contribution of the T to the classification of ac- i ð3Þ tivity A . Similarly, in (6), e2 is used to describe the k i distribution of tag T across the BoT subsets in B , i A where C(B,T) is the number of occurrences of tag T in which corresponds to all different activities. When j i i B (i.e.,the jthsubsetofB ),given by e2 reaches its maximum, however, the T are j Ak i i Yuetal.EURASIPJournalonAdvancesinSignalProcessing (2019) 2019:14 Page7of23 Table4ExampleoftheBoTclassifier Activity ζ withtheentropy-basedTF-IDFvalue k 1 2 3 4 5 6 … Computeruse Keyboard Monitor Screen Internet Electronics Laptop … 0.4328 0.3792 0.3255 0.3127 0.3071 0.2662 … Eating Food Drink Restaurant Dinner Cooking Bowl … 0.3678 0.3286 0.3240 0.2894 0.2594 0.2361 … Shopping Stock Market Shopping Shop Merchandise Supermarket … 0.4216 0.4185 0.4079 0.3363 0.2724 0.2373 … Washingup Bathroom Wash Bath Hygiene Faucet Bathtub … 0.4955 0.4375 0.2879 0.2859 0.2789 0.2699 Transportation Dashboard Steeringwheel Control Fast Drive Driver … 0.2769 0.2769 0.2733 0.2716 0.2696 0.2669 … … … … … … … … uniformly distributed among the BoT subsets in B , to form the BBA for images; an example of this can A which means that T has no ability to distinguish dif- be seen in the third row (BBA of image) of Table 6. i ferent activities. Therefore, the value of e2 is in- i versely proportional to its contribution to the classification, which is the opposite of e1 . Balan- 2.3BBAofIMUandGPSsensors i, k cing these two effects, the entropy-based TF-IDF is For IMU sensors, the output data are multiple 1-D given by waveforms that can be processed using traditional pattern recognition methods [9]. First, the data are divided tfi;k (cid:2)idfi(cid:2)e1i;k (cid:2)Rðe2iÞ¼tf(cid:9)i;k (cid:2)idfi(cid:2)e1i;k (cid:10) into non-overlapping segments, and the structural and e2 statistical features of each segment are extracted. These (cid:2) 1− i ð9Þ log K þλ features are used to train a classifier. The training ends 2 when acertainstoppingcriterion ismet. where R(e2)=1−e2/(log K+λ) is used to remap e2 IMU sensors include an accelerometer and a gyro- i i 2 i so that its value is proportional to the contribution in scope, each producing three traces of signals in the the classification. The parameter λ is an empirically x-, y-, and z-axes. These signals are divided into 3-s determined small positive constant that guarantees segments without overlapping. To synchronize them R(e2)>0. with the corresponding images, each segment is cen- i Using (9), the BoTclassifier can be obtained by appl- tered around the time stamp in the image data. The ying a suitable training procedure. Specifically, the features extracted in each segment include the mean, entropy-based TF-IDF weight of each tag in the sample standard deviation, correlation, signal range (differ- BoT set is calculated, and the M tags with the highest ence between maximum and minimum), root mean weight values are extracted from B to form the class square, signal magnitude area [37], autoregressive Ak center vector ζ corresponding to activity A . All class coefficients (calculated up to the sixth order), and k k center vectors constitute the BoTclassifier, given by the binned distribution (selected to be 10) [38]. These features are combined with the GPS velocity Classifier ¼fζ ;ζ ;…;ζ ;…;ζ g: ð10Þ and coordinates (if unavailable, the most recent GPS B 1 2 k K data are used) to form 127-dimentional feature vec- An example of the BoT classifier is presented in tors that are fed into a multiclass SVM for training Table4. and classification. When using the classifier defined in (10), the cosine Support vector machine (SVM) [39] is a supervised similarity between the input BoT and the center vec- machine learning method widely used in classification tor of each class (i.e., ζ ) can be calculated, and the and regression analysis. SVM can improve the k class whose center is closest to the input is assigned generalization ability of a learning machine by minimiz- as the classification result. In addition, as the cosine ing the structural risk; hence, it can also yield reason- similarity is between 0 and 1, it can be directly used ably good statistical rules for a relatively small sample Yuetal.EURASIPJournalonAdvancesinSignalProcessing (2019) 2019:14 Page8of23 size. The dual objective function of SVM can be given widely used combination rule is the Proportional Con- by the Lagrangian multiplier method as shown below flict Redistribution (PCR) rule. There are six PCR rules (PCR1–PCR6), defined in [18]. Their differences are max minℒðw;b;αÞ¼ max min ð11Þ αi≥0 w;b αi≥0 w;b! mainly in the method of proportional redistribution of the conflicting beliefs. Among these rules, PCR5 is 1kwk2−Xn α(cid:5)y(cid:5)wTx þb(cid:6)−1(cid:6) widelyused to combine two sources and PCR6 is usually 2 i i i applied to more than two sources. In particular, PCR6 is i¼1 the same as PCR5 when there are exactly two sources. If where x is the input data, y is the category to which x s represents the number of sources, the PCR5/PCR6 belongs, w is the vector perpendicular to the classifica- combinationrulefor s=2is tion hyperplane, b is the intercept, and α is the Lagrange multiplier. m1PX⊕CR25=PCR6ðAÞ¼ X After solving (11) using the quadratic programming m ðX Þm ðX Þþ algorithm and introducing the kernel function κ(x ,x ) 1 1 2 2 =(〈x1,x2〉+1)2 to map the data to the high-dimens1ion2al XX11;∩XX22∈¼DΘA XX∈∩DAΘ¼∅ ð14Þ space, SVM can perform a nonlinear classification accor- (cid:11) (cid:12) m2ðAÞm ðXÞ m2ðAÞm ðXÞ dingtothefollowingbinaryprediction: 1 2 þ 2 1 (cid:5) (cid:6) m1ðAÞþm2ðXÞ m2ðAÞþm1ðXÞ gSVMðxÞ¼ sign wXTNxþb ! where m1⊕2 denotes m1⊕m2, i.e., sources 1 and 2 are ¼ sign αyκðx; xÞþb : ð12Þ used for evidence fusion for the focal element A in dis- i i i Θ i¼1 cernment framework D . The PCR6 combination rule for s>2is Commonly used kernel functions include polynomial kernelandradial basisfunction. X Ys TheSVMisfundamentallyatwo-classclassifier;however, mP1⊕CR2⊕6…⊕sðAÞ¼ miðXiÞþ it can be extended to multiclass problems by using X1;X2;…;Xs∈DΘ i¼1 one-against-one or one-against-all voting schemes. In ∩si¼1Xi¼A X Xs−1 X addition,thebasicSVMclassifiercanonlyoutputtheclas- seivfiidcaetniocen flaubsieolnr.atThoerstohlvane tthhiespprroobbalbemilit,ythoer p“loibsssivbmili”ty[4fo0r] XX1i≠(cid:13);AX;2i;∈…f;1(cid:14)X;2s−;1…∈;DsΘ−1g k¼1ði1;i2;…;isÞ∈Pð1;2;…;sÞ toolkit, which convertsthe output of the standard SVM to ∩sj−¼11Xi ∩A¼∅ ð15Þ a posterior probability using a sigmoid-fitting method [41], 2 3 Yk Ys−1 (cid:5) (cid:6) isutilized. An exampleis provided in thefourth row (BBA 6 m ðAÞ m X 7 ofsensors)ofTable6. 66Xk ij ip p 77 6 m ðAÞ(cid:2) j¼1 p¼kþ1 7 2.4Hierarchicalfusionofknowledge,image,andsensor 64p¼1 ip Xk m ðAÞþ Xs−1 m (cid:5)X (cid:6)75 ij ip p databyDSmT j¼1 p¼kþ1 In DSmT,the discernment framework Θ={θ ,θ , …,θ } 1 2 n is extended from the power set 2Θ in Dempster–Shafer where P(1, …,s) is the set of all permutations of ele- theory to the hyper-power set. The hyper-power set, de- ments{1, …,s}. noted by DΘ, admits the intersections of elements on In the proposed approach, when DSmT is used for the basis of the power set. For example, if there are two ADL recognition, the discernment framework contains elements in the discernment framework Θ={θ ,θ }, the 15ADLs,asdetailedinEq.(16)andTable5. 1 2 power set is 2Θ={∅,θ ,θ ,θ ∪θ } and the hyper-power set is DΘ={∅,θ1,θ2,θ11∪Θ2θ2,1θ1∩2θ2}. The BBA defined fΘ“c¼leafnAin1;gA;”2“;c…om;Ap1u5tger¼use;”“eating;”“entertainment;” onthehyper-powersetD is “lying down;”“meeting;”“reading;”“shopping;”“talking;” 8 <mðX Þ:DΘ→½X0;1(cid:3); X ∈DΘ “telephone use;”“transportation;”“walking outside;” :mð∅iÞ¼0; mðθÞi¼1 ð13Þ “washing up;”“watching TV;”“writing”g θ∈DΘ ð16Þ The combination rule is the core of evidence theory. It combines the BBAs of different sources within the same As the total number of sources is three (i.e., know- discernment framework to produce a new belief assign- ledge, image, and sensor data), PCR6 should be selected ment as the output. In the DSmT framework, the most as the evidence combination rule if all sources are used Yuetal.EURASIPJournalonAdvancesinSignalProcessing (2019) 2019:14 Page9of23 Table5Thedescriptionofthediscernmentframeworkdefined many cases, if the three sources of information are inEq.(16) fused directly, the accuracy of the output is often af- Θ fected by the low specificity of the motion sensors. 1 Cleaning(CN) 6 Meeting(MT) 11 Transportation(TP) However, we still need to use motion sensors to identify ADLs that have significant motion signatures, such as 2 Computeruse 7 Reading(RD) 12 Walkingoutside (CU) (WO) “cleaning,”“walkingoutside,”and“lyingdown.”Therefore, considering the reliability of each information source, we 3 Eating(ET) 8 Shopping(SP) 13 Washingup(WU) consider user knowledge and image sources to be 4 Entertainment 9 Talking(TK) 14 WatchingTV(TV) (EM) high-priority data and the motion sensor source to be low-priority data, i.e., we supplement the sensor informa- 5 Lyingdown(LD) 10 Telephoneuse 15 Writing(WT) (TU) tion only when the fusion of user knowledge and image sourcesleadstoaconflict. We implement the source-priority concept using a two-level hierarchical fusion network with descend- in the data fusion process. An example of the fusion re- ing candidate sets (2-L HFNDCS, see Fig. 4), similar sultfromthreesourcesusing(15)ispresentedinTable6. to the implementation strategy proposed in [42, 43]. Inthis example,theBBAsofknowledge,image,andsen- When the two-source fusion between the knowledge sor data are derived from the time–activity table, cosine and image-based methods provides a conflicting re- similarity between current BoT and class center, and sult, motion sensor data are added to the pool of posterior probability of the support vector machine evidence for a second-level three-source fusion. In- classifier’soutput,respectively. stead of considering all activities, only the candidate In our case, the information sources differ greatly activities identified by two-source fusion are used as in the signal type and processing algorithm, e.g., the the input for the three-source fusion. The initial image source provides a specific combination of ob- number of candidate activities is given in advance, jects, whereas the sensor source provides the motion and this number can be adjusted according to subse- status of the person wearing the device. Hence, the quent test results. The output of the final fusion is corresponding recognition results are often different. the activity with the highest belief among the candi- This can be observed in Table 6. For the same activ- date activities. The 2-L HFNDCS algorithm can be ity, the recognition results from the image and sensor described as follows. sources are “entertainment” and “watching TV,” respectively. In fact, “entertainment” (specifically “play- ing poker” in this case) and “watching TV” are both Algorithm of 2-L HFNDCS sedentary activities, and it is difficult to distinguish Input: BBA of knowledge (BBA_k), BBA of vision (BBA_v), BBA ofsensors (BBA_s), Number of candidates (Nc) them using motion sensors (both the IMU and GPS Output: Activity recognition result (Ax) sensors). Therefore, the recognition result from the (1) Compute Fu1 = PCR6(BBA_k, BBA_v) by two-source PCR6 image source should be more reliable. However, after combination rule in (14) (2) Let Max_pos(.) denote the position of the maximum in a matrix fusion, the final recognition result is “watching TV” (3) If Max_pos(Fu1) = = Max_pos(BBA_v) //no conflicts because the belief value of “entertainment” assigned (4) Ax = Max_pos(Fu1) (5) Else by the BBA of the sensors is very low. (6) Sort Fu1, obtain the positions of the first Nc maximums, i.e., Based on previous research [15, 16] and our own Max_pos(Fu1, Nc) (7) Compute Fu2 = PCR6(BBA_k,BBA_v,BBA_s) using three-source study (described in Section 3), most ADLs achieve sig- PCR6combination rule in (15) nificantly higher accuracy when using vision-based (8) Ax = Max_pos(Fu2(Max_pos(Fu1, Nc))) data than with motion sensor-based data. Thus, in (9) End Table6Exampleofthree-sourcefusionusingthePCR6rule CN CU ET EM LD MT RD SP TK TU TP WO WU TV WT BBAofknowledge 0.1860 0.0233 0.2326 0.1163 0 0 0.0233 0 0.1163 0.1860 0 0.0698 0.0233 0.0233 0 BBAofimage 0.0401 0.0260 0 0.4452 0 0.1526 0.0610 0 0.0939 0.1505 0 0 0 0 0.0308 BBAofsensors 0.0041 0.0303 0.0078 0.0558 0.0338 0.0076 0.0077 0.0229 0.1781 0.0264 0.0101 0.0174 0.0178 0.5602 0.0200 Fusionresult 0.0565 0.0057 0.0754 0.2561 0.0031 0.0427 0.0103 0.0015 0.1017 0.0960 0.0003 0.0106 0.0022 0.3341 0.0037 Conditions:thetimestampofthecamerais17:30:57onThursday.ThecapturedimagecanbeseeninFig.6(d),andthegroundtruthis“entertainment” ItalicsrepresentthemaximumvalueoftheBBAforallactivitiesofthesameinformationsource,andthecorrespondingactivityistherecognitionresultofthat informationsource Yuetal.EURASIPJournalonAdvancesinSignalProcessing (2019) 2019:14 Page10of23 Fig.4Architectureof2-LHFNDCS 3 Experimental results and 2. The volunteers then wore the eButton for a rela- 3.1Experimentalsetupanddataacquisition tively long time (approximately 10h per day for about Previously, our laboratory developed eButton (Fig. 5), a 3 months). To form a gold standard for performance disk-like wearable device the size of an Oreo cookie that comparison, the resulting egocentric data were manu- can be used to study human diet, physical activity, and ally reviewed and annotated. For regular daily routines, sedentary behavior [23]. The eButton is equipped with a the environment and motion patterns corresponding to camera, IMU, and other sensors that are not used for certain activities were very similar. In contrast, the fre- the current study, such as those for measuring the quency and duration vary widely among less regular ac- temperature, lighting, and atmospheric pressure. The tivities, resulting in a large imbalance among the resolution of the camera is 1280×720 pixels. To save number of samples corresponding to different activities. power, the camera acquires one image every 4s. The To reduce this data imbalance, a key frame extraction built-in IMU contains a three-axis accelerometer and a method was used [44, 45]. As the two eButton wearers three-axis gyroscope with a sampling frequency of 90 each participated in the study for about 3 months, we Hz. The GPS data are acquired from the wearer’s mobile had sufficient data to form two independent datasets phone at 1-s intervals and synchronized with the eBut- (one for training and one for testing). We combined ton datausingtimestamps. these data to form an egocentric activity dataset, called Two volunteers with regular daily routines and rela- the eButton activity dataset [47]. tively invariant living environments were selected for In the eButton activity dataset, each wearer (referred our experiments. After signing a consent form ap- to as W1 and W2) has a separate set of time–activity ta- proved by the Institutional Review Board, they were bles, a training set, and a test set. Although the training asked to fill out the time–activity table described above. set and the test set do not overlap, they both have the Their time–activity tables are provided in Appendixes 1 same structure: a subset of egocentric images, a subset Fig.5AppearanceoftheeButtonandexamplesofitswearingmethods

A Multisource Fusion Framework Driven By User-Defined Knowledge For Egocentric Activity Recognition PDF

2019·8.5 MB·

by Haibin Yu

#additional_collections #dezert-smarandache-theory

Checking for file health...

Save to my drive

Quick download

Download

Upgrade Premium

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview A Multisource Fusion Framework Driven By User-Defined Knowledge For Egocentric Activity Recognition

See more

The list of books you might like

Upgrade Premium

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.