ebook img

Sign Language Recognition Using Temporal Classification PDF

0.5 MB·
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Sign Language Recognition Using Temporal Classification

Sign Language Recognition using Temporal Classification Hardie Cate ([email protected]) Fahim Dalvi ([email protected]) Zeshan Hussain ([email protected]) December 11, 2015 7 1 Introduction extracted included motion of the hands, location of the 1 0 sign being performed, and handshapes used. These data 2 In the US alone, there are approximately 900,000 hear- arecollectedtemporally(i.e. atmultipletimesteps). The ing impairedpeople whose primarymode ofconversation authorsuse anSPMmethodinorderto addresscommon n a is sign language. For these people, communication with issueswithothermodelsthathavebeenusedforsignlan- J non-signers is a daily struggle, and they are often disad- guage classification and translation, including unrefined 7 vantagedwhenitcomestofindingajob,accessinghealth featureselectionandnon-discriminatorylearning. Ingen- care,etc. Thereareafewemergingtechnologiesaimedat eral,therehasbeensignificantworkdoneinusingsequen- ] V overcoming these communication barriers, but most ex- tial pattern mining methods to analyze temporal data istingsolutionsrelyoncamerastotranslatesignlanguage [7][8]. Other authors use Hidden Markov Models (HMM) C intovocallanguage. While thesesolutionsarepromising, on features extracted from video frames to perform the . s they require the hearing impaired person to carry the same task. Starner et. al. hypothesize that the fine c technology with him/her or for a proper environment to movements of a signers hand are actually not required, [ be set up for translation. and that coarse position and orientation of the hand are 1 Onealternativeistomovethetechnologyontotheper- discriminativeenoughtoclassifysigns[10].Theyalsotry v son’sbody. DevicesliketheMyoarmbandavailableinthe twocameraperspectives,onedirectlyinfrontofthesigner 5 7 markettodayenableustocollectdataabouttheposition and the other on a cap that the person is wearing. 8 ofthe user’shands andfingersovertime. Since eachsign 1 is roughly a combination of gestures across time, we can 0 use these technologies for sign language translation. For Other techniques that do not involvevisualinput have . 1 our project, we utilize a dataset collected by a group at also been studied. These techniques normally rely on 0 the University of South Wales, which contains param- data from sensors that measure features like hand po- 7 eters, such as hand position, hand rotation, and finger sitions and orientations, finger bend measures, etc. For 1 bend, for 95 unique signs. For each input stream repre- example, Kadous uses a novel technique in his paper to : v sentingasign,wepredictwhichsignclassthisstreamfalls improve classic machine learning algorithms by creating Xi into. We beginbyimplementing baselineSVMandlogis- meta-features [4]. These meta-features are derived from tic regression models, which perform reasonably well on the raw features by looking at important events in the r a high-qualitydata. Lowerqualitydatarequiresamoreso- time-seriesdata. AnexampleKadoususesinhispaperis phisticated approach, so we explore different methods in the vertical maxima that a person’s wrist reaches while temporal classification, including long short-term mem- signing. A meta-feature like this gives meaning to the oryarchitecturesandsequentialpatternminingmethods. ”y-axis”inthe data,andhelpscreatebetter features. He alsolooksatotherautomatictechniquestogeneratethese meta-features by looking for variations across time in a 2 Related Work given dataset. Another study by Mehdi et al. proposes the use of neural networks to classify signs correctly [6]. Several techniques have been used to tackle the problem Theyfocusonsignsthathavedistinctstaticshapesrather of sign language to natural language translation. Con- than analyzing the signs over time. Some studies have ventionally, sign language translation involves taking an also suggested using HMMs to detect the gestures being input ofvideo sequences,extractingmotionfeaturesthat performed. Liang et. al. employ this technique, along reflectsignlanguagelinguistic terms,andthenusingma- with a Sign Language to English language model to pre- chine learning or patternmining techniques onthe train- dictaparticularsign[5].Finally,astudybyGravesetal. ingdata. Forexample,Ongetal. proposeanovelmethod uses strong classification to predict a sequence of labels calledSequentialPatternMining(SPM)thatutilizestree giventime-seriesdata,ratherthanasinglelabel[3].They structures to classify signs [7]. Their data was captured use a recurrent neural network (RNN) integrated with a using a mobile camera system and the motion features softmax classifier to achieve this prediction. 1 of the important information is retained, which sup- Raw X Resampled X −0.08 ports our choice of normalization. This also helps −0.10 −0.12 nullifythe affectsofthespeedatwhichasignisper- −0.14 formed, as some signers sign at a higher speed than −0.16 −0.18 others. Temporallyscalingthedatawouldnormalize Raw Y Resampled Y all signals to be at roughly the same speed. 0.08 0.07 0.06 2. Spatial scaling: Each sign in our datasets (espe- 0.05 0.04 cially in the low quality dataset) was performed at 0.03 0.02 differenttimesunderdifferentconditions. Hence,the 0.01 Raw Z Resampled Z signs were performed at varying relative positions 0.00 −0.01 and orientations. For example, a sign that involves −0.02 −0.03 movingthe wristin a parabolicmotion may haveits −0.04 −0.05 vertex at varying heights between runs. To normal- −0.06 ize these variances, we spatially scale all the signals −0.07 0 10 20 30 40 50 0 10 20 30 40 50 60 to be between 0 and 1. Figure 1: Signal resampling 3. Time-series flattening: For our datasets, each sampleconsistsofaNUM FEATURES×TIME STEPSma- trix, where each row represents one of the distinct 3 Datasets motion parameters and each column is a particu- lar frame. We process this data and store it in a We are primarily using two datasets from the research 3-D matrix, whose dimensions are NUM EXAMPLES× project by Kadous [4]. The first dataset is a high qual- NUM FEATURES×TIME STEPS. Forthealgorithmsthat ity dataset, with data recordedat a frequency of 200 Hz. cannot take into account the temporal nature of the This dataset was recorded using two 5 dimensional flock data, we transform this matrix into a flattened 2- gloves on each hand. We have access to 6-bit precision D matrix of size NUM EXAMPLES×(NUM FEATURES× fingerbendmeasuresforeachfinger,aswellas14-bitpre- TIME STEPS) such that the first NUM FEATURES fea- cisionorientationandpositionofeachwristinspace. The tures in each row are the reading at time t = 1, second dataset is a low quality dataset, with data avail- the next NUM FEATURES features correspond to time ableforonlyonehand. Thedatawasrecordedatamuch t=2, etc. lowerfrequency of 50 Hz using the Nintendo Powerglove. Weonlyhave2-bitprecisionforthefingerbendmeasures, and data was recordedfor only 4 fingers. The position of 4 Technical Approach the wristalsohasa lowerprecisionof8-bits. In addition, we have access to only one degree of rotation, which was 4.1 Baseline recordedatroughly4-bitprecision. Bothdatasetsconsist We begin with several baseline implementations using of time-series data for 95 signs. In the high quality data, SVM and logistic regression models. Since both of these wehave27instancesofeachsign,whileinthelowquality techniques require single dimensional features, we use data, we have 70 instances of each sign. These instances temporal scaling and time-series flattening. We use both were recorded across various sessions. The source of the linear and RBF kernels for our SVM, but the additional high quality data was a single professional signer, while complexityoftheRBFkerneldidnotimproveourresults the lowqualitydatawasrecordedusingseveralsignersof significantly. For both of these models, we use one-vs- varying levels of proficiencies. rest strategy to build classifiers for each sign. We train both of these models on 70% of the data, and use the 3.1 Preprocessing remaining 30% as test instances for each of the 95 signs. We perform some of the following preprocessing steps on We use the scikit-learn library for our SVM and logistic the dataset, depending on which algorithm we are using: regressionimplementation [9]. 1. Temporal scaling: The average number of frames 4.2 Long-Short Term Memory foreachsignis57frames,sowenormalizeallsignsto this length by resampling using a fast Fourier trans- Thefirstcomplexmodelthatweworkwithis arecurrent form. In Figure 1, the graphs on the left display neural network, specifically the long-short term memory readings for a single motion parameter over the life- architecture. We choose this model because it takes into time of the sign, while the corresponding graphs on account the temporal features of our data, unlike the therightdepicttheresampledversion. Ingeneral,we baseline models. We start out by trying to mimic the re- notice that there are not many differences between sults of logistic regression using a simple neural network the original data and the resampled version. Most withahiddenlayerthatusedsigmoidactivation. Oncewe 2 havesufficientperformance,weusethesamearchitecture at least as often as the pattern itself. Generating for eachtime step, andthen connectthe hidden layersso patterns of length k+1 involves considering all pat- that we can perform backpropagation with time. Other terns of length k that have the same k−1 states as architectures with more layers (both fully connected and their prefix. partially connected layers) were also considered. The fi- nal architecture is a three layer network. The first layer 3. Binary vector creation: Finally, having a set of is a time-connected layer, the second layer is a fully con- patternsthatremainafter ourcandidategeneration, nected dense layer for each time step, and the final layer we use a Chi-square test to rank all the patterns by isadenselayerthatoutputsa95-dimensionalvector. We their ability to distinguish between the signs. After utilize mean squared error as our loss function. We use ranking, we choose the top MAX PATTERNS patterns, the Keras library for our LSTM implementation [2]. where MAX PATTERNS is a hyperparameter we tune. Once we have MAX PATTERNS patterns, for each in- stance in our dataset, we build a binary feature vec- 4.3 Sequential Pattern Mining tor, where a 1 in position i indicates that pattern i ThesecondtechniquethatwetrySequentialPatternMin- occurs in that instance. ing. Batalet. al. havedescribedanalgorithmtoperform multi-variate time series classification [1]. Their method 5 Results and Analysis is primarilya feature engineering technique that looks at thecombinationofthesignalsinaninstanceandoutputs a binary feature vector. We can then train a standard 5.1 Experiments SVM over these binary feature vectors to perform our With each of our models, we try several experiments. classification. The algorithm tries to find a set of pat- Specifically, for our baseline SVM model, we try vari- terns in the signals that serve as a fingerprint for the ous kernels. We also try regularizing both our baseline class of that signal. After preprocessing (temporal and SVMandlogisticregressionmodelstopreventoverfitting spatial scaling), SPM primarily has three steps: on the data. For the LSTM architecture, we experiment 1. Discretization: The first step in the algorithm is with several activation functions for each layer and try todiscretizetheinputsignalsintodiscretevalues. In adding intermediate layers like Dropout to prevent over- our implementation, we try two different sets, high, fitting. Finally, for our SPM approach, we have several middle, low and very high, high, middle, low, very hyperparametersto tune such as the window size, length low. Becauseofour spatialscaling,the signalvalues of patterns generated during candidate generation and are between 0 and 1. We set thresholds for each of the minimum support required for each candidate pat- the discrete values depending on the set we are us- tern. We also tune parameters such as the regularization ing. Thus, the discretized signal might look like the and feature vector length on which we train the SVM. following: HMLLHHLMMMMLH, where H is high, M is Finally,sinceSPMinvolvesdiscretizingthesignal,wetry middle, and L is low. Subsequently, we combine all twodifferentimplementations: discretizingtherawsignal consecutive values that are equal. Hence, our exam- itself and discretizing the rate of change in the signal. ple would transform into HMLHLMLH. 5.2 Results 2. Candidate Pattern generation: Thenextstepof the algorithm is generating patterns. A pattern is Thebaselinemodelsperformverywellonthehighquality defined as a list of states, where each pair of con- data. Both the SVM and the logistic regression models secutive states is connected by a relation. In our give us a test error of 5.8% on the high quality dataset. case, the states are H:1, which indicates that signal Sincewealreadyhavegoodperformanceonthehighqual- 1 was high. We also consider only two relations, be- ity dataset, we focus on the low quality dataset for the fore andoverlap. Hence,H:1-b;L:2impliesthatthe remainder of the project. The results on the low quality state with high value in signal 1 occurs before an- dataset are shown in table 1. otherstateinsignal2thathasalowvalue. Generat- ingcandidatepatternsalternatesbetweengenerating all possible patterns of a certain length k and prun- Table 1: Algorithm performance on the low quality ing them. We start with all patterns of length 1, i.e. dataset k =1,thenproceedtok =2etc. Toprunepatterns, SVM Log. Reg. LSTM SPM we see if the pattern appears in any of the instances Precision 0.566 0.444 0.109 0.075 of a given class. If it appears a minimum number Recall 0.550 0.444 0.091 0.076 of times (denoted by support), we keep the pattern. F1 0.549 0.436 0.066 0.065 Thus, we follow the approach of the Apriori algo- Training error 0.001 0.179 0.852 0.526 rithm for frequent item set mining which relies on Testing error 0.450 0.556 0.908 0.895 the fact that any of a pattern’s subpatterns appear 3 confusion matrix of sign multiclassification confusion matrix of sign multiclassification 0 0 poorerperformance,wethinkthatusingmuchlongerpat- terns (window size ∼ 20, maximum pattern length ∼ 30) 20 20 would give us a better result. Unfortunately, our imple- mentationruntime increasesexponentiallyasweincrease True label4600 True label4600 timheusmepleanragmtheuteprst,oawndhihchenwceecwoeuwldeirneclriemaisteedouorngtehneermataexd- patterns. Another hypothesis we have for the generally 80 80 poorperformanceisthattheSPMalgorithmgivesusaset of patterns it thinks are most distinguishing. After this, 0 20 40 60 80 0 20 40 60 80 Predicted label Predicted label we check if each of the patterns occur in the instances to (a) High Quality (b) Low quality build our feature vector,but we do nottake into account Figure 2: Confusion matrix for SVM where orhow many times eachof these patterns occur in the instances, thus losing some more information in this process. First,wenotethatbothourtrainingandtestingerrors usingSVMareverylow,soourmodelisnotsufferingfrom 5.3 Analysis overfitting (see Table 1). Additionally, all other metrics, including precision, recall, and F1 score are very high, suggesting a high, but also precise, level of performance. This theory is substantiated by the confusion matrix for 8 the SVM (see Fig. 2a), which shows that the classifier is 6 notconfusingasignwithsomeothersign. Theclearblue 4 line down the diagonal is evidence of this claim. Note that the confusion matrix as well as the metrics on the 2 lowqualitydatasetaremuchworsethanthoseonthehigh 0 quality dataset. In general,our classifier confuses similar −2 signs more often, which is expected because there might −4 not have been enough features to distinguish these signs. −6 Unlike the baseline models,the LSTMresults arepoor 10 on the low quality dataset. Although we initially expect 8 6 the LSTM to perform better, as it takes into account 4 tnLhoSetTctMhema.nWpgoeeramhlyunpcaohttuherveeesinzoefafttthheaertvdtaahrtyiasi,nisgthbaereccphaeiurtsfeoecrtomufraesnsocmfeoerdktoheeyes −6 −4 −2 0 2 4 6 8 10 −4−202 assumptions that the standard LSTM model makes that do not apply to our data. In the standard LSTM model, Figure 3: Feature space reduced to three dimensions backpropagationthroughtimeisdoneateverytimestep. This is usually acceptable for time series data since we Toexplainthe variancesinoutresultsbetweenthe dif- normallywanttopredictthevalueatthenextconsecutive ferent models, we decide to analyze the feature space, timestep. However,inourcase,backpropagatingateach importance of each feature and the effect of various hy- step leads to a poorer model, since we are not trying perparameters on our models. to predict the next value in the signal (eg. next hand Giventhatthebaselinemodelsperformbetterthanthe position, orientation etc.), but rather we want to classify other models, we decide to analyze the flattened feature theentiresignalasoneunit. Hence,wewouldliketoonly space and see if it was truly separable in the high di- backpropagate at the final timestep to achieve a better mensional space. We hypothesize that each class has its model. Since building a custom LSTM model would be ownclusterin the highdimensionalspaceandis farfrom time consuming, we pursue SPM as it is more promising other classes. One way to confirm this is to use PCA for our particular task. to reduce the feature space to 3 dimensions and plot the The SPM model performs slightly better than the data. We see in Figure 3 that examples from the same LSTM model. However, with around 10% accuracy, the class(indicated by similarcolors)cluster together,which model is not very strong. For the best result, we had a indicatesthepresenceoftheclustersinthehigherdimen- windowsizeof2,minimumsupportof20andamaximum sional space. patternlengthof2. Wepositthatbecauseofsuchlimiting Next,wewanttoseewhichofthefeaturesaremostdis- values,thepatternswegeneratearenotverylong. Hence, criminative,sothatwecouldrestrictthefeatureswewere we are losing a lot of distinguishing information by not using the the more complex models to save on runtime. havinglongerpatterns. Althoughslightlylongerpatterns We run two rounds of ablation tests on the SVM to de- (windowsize∼10,maximumpatternlength∼15)giveus termine which features were the most significantcontrib- 4 utorstotheoverallperformance. Thefirstroundremoves Asourfinalanalysis,wealsotryconcatenatingthefea- each feature independently and measures the results on ture vectors we get from the SPM algorithm along with thedatawithoutthatonefeature,whilethesecondround the raw flattened features. Using a small subset of these removes an increasing number of features. All of these features (∼50), we found that we get a 1% bump in ac- testsarerunonthelowqualitydataset. Fromtheresults curacy over our baseline SVM. of these tests, we see that removing the position features of the hand results in a significant increase in test error (see Table 2). Additionally, the largest jump in error for the second set of ablation tests occurs when we remove the position and rotation features. Removing the finger features does not have a significant impact on the per- formance of the model, suggesting that the position and rotationfeatures arethe most distinguishing features be- tween the signs. Table 2: Ablation test results Figure4: PerformanceofSPMwithvaryingdatasetsizes Removed features Test error None 0.450 POS 0.786 ROT 0.519 6 Conclusions and Future Work F1 0.491 F2 0.472 Inthis paper,we study andapply machinelearningtech- F3 0.489 niques for temporal classification, specifically the multi- F4 0.458 variate case. Although the results obtained are not very POS, ROT 0.881 high, we believe that a more efficient implementation of POS, ROT, F1 0.896 thealgorithmscanyieldbiggerandmorecomplexmodels POS, ROT, F1, F2 0.898 that will perform well. POS, ROT, F1, F2, F3 0.929 In the future, we plan to improve the implementation Here,POSand ROTrefer totheposition and orientation of the behind SPM to build better models. We may also con- right wrist respectively. F# refers to the fingers on the hand. sider implementing a custom LSTM model that removes The ordering of thefingers is thumb,index, middle, ring. theassumptionsofthetechniquethatdonotapplytoour data. Finally, wewouldalsoliketo use a deviceavailable inthemarkettoday,namelytheMyoarmband,to record Furthermore, we perform extensive hyperparameter our data, and try our models on this data. Even though tuning for our SPM model. Since our hyperparameters the data that will be collected will not be exactly the space is quite large, we use a procedure akin to coordi- same as our current data, we believe that the techniques nate ascent to find the optimal set of hyperparameters. we have tried and implemented are general enough for For the larger models, we also start with a small num- them to work well on the new data. ber of signs to build an intuition on the affect of varying Most importantly, we have shown that at least with each hyperparameter, and then slowly expand our train- high quality data, it is indeed possible for us to translate ing/testdatasetstoinclude moresigns. AsshowninFig- sign language into text. We hope that some day this ure 4, the test errorincreasesas we include more signsin willenablehearingimpairedpeopletocommunicatemore our analysis. We started with 5 signs and progressively effortlessly with the rest of the society. addedsetsofsigns,andforeachsetwecomputedthehy- perparametersthatgaveus the bestaccuracy. As we can see, the window size reduces as we increase the number of signs. One reason for this may be as we increase the number of signs, the probability of us seeing a pattern againincreases if the window size is held constant. Since we want to choose the patterns that are most discrimi- native, a lower window size leads to better performance with a highnumber of examples. We alsonotice that the optimal window size and maximum pattern lengths are quite small (< 10) for all sets of signs we tried, indicat- ing that the instantaneous motions in each sign serve as discriminative features, rather than longer patterns. 5 7 References [1] Iyad Batal et al. “Multivariate Time Series Classi- fication with Temporal Abstractions.” In: FLAIRS Conference. 2009. [2] Franc¸ois Chollet. Keras. https://github.com/fchollet/keras. 2015. [3] Alex Graves et al. “Connectionist temporal classi- fication: labelling unsegmented sequence data with recurrent neural networks”. In: Proceedings of the 23rd international conference on Machine learning. ACM. 2006, pp. 369–376. [4] Mohammed Waleed Kadous. “Temporal classifica- tion: Extending the classificationparadigmto mul- tivariate time series”. PhD thesis. The University of New South Wales, 2002. [5] Rung-Huei Liang and Ming Ouhyoung. “A real- time continuousgesturerecognitionsystemforsign language”. In: Automatic Face and Gesture Recog- nition, 1998. Proceedings. Third IEEE Interna- tional Conference on. IEEE. 1998, pp. 558–567. [6] Syed Atif Mehdi and Yasir Niaz Khan. “Sign lan- guage recognition using sensor gloves”. In: Neu- ralInformation Processing, 2002. ICONIP’02. Pro- ceedings of the 9th International Conference on. Vol. 5. IEEE. 2002, pp. 2204–2206. [7] Eng-Jon Ong et al. “Sign language recognition us- ing sequential pattern trees”. In: Computer Vision andPatternRecognition(CVPR),2012 IEEECon- ference on. IEEE. 2012, pp. 2200–2207. [8] Panagiotis Papapetrou. “Constraint-Based Mining of Frequent Arrangements of Temporal Intervals”. PhD thesis. Boston University, 2007. [9] F. Pedregosa et al. “Scikit-learn: Machine Learn- ing in Python”. In: Journal of Machine Learning Research 12 (2011), pp. 2825–2830. [10] Thad Starner, Joshua Weaver, and Alex Pentland. “Real-time american sign language recognition us- ing desk and wearable computer based video”. In: Pattern Analysis and Machine Intelligence, IEEE Transactions on 20.12 (1998), pp. 1371–1375. 6

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.