ebook img

Deep Neural Networks for Czech Multi-label Document Classification PDF

0.12 MB·
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Deep Neural Networks for Czech Multi-label Document Classification

Deep Neural Networks for Czech Multi-label Document Classification LadislavLenc1,2andPavelKra´l1,2 1 Dept.ofComputerScience&Engineering FacultyofAppliedSciences UniversityofWestBohemia Plzenˇ,CzechRepublic 2 NTIS-NewTechnologiesfortheInformationSociety 7 FacultyofAppliedSciences 1 UniversityofWestBohemia 0 Plzenˇ,CzechRepublic 2 {llenc,pkral}@kiv.zcu.cz n a J 8 Abstract. Thispaper isfocused on automaticmulti-label document classifica- 1 tion of Czech text documents. The current approaches usually use some pre- ] processingwhichcanhavenegativeimpact(lossofinformation,additional im- L plementationwork,etc).Therefore,wewouldliketoomititandusedeepneural C networksthatlearnfromsimplefeatures.Thischoicewasmotivatedbytheirsuc- . cessfulusageinmanyothermachinelearningfields.Twodifferentnetworksare s compared: the first one is a standard multi-layer perceptron, while the second c [ oneisapopularconvolutionalnetwork.TheexperimentsonaCzechnewspaper corpusshowthatbothnetworkssignificantlyoutperformbaselinemethodwhich 2 usesarichsetoffeatureswithmaximumentropyclassifier.Wehavealsoshown v thatconvolutionalnetworkgivesthebestresults. 9 4 Keywords: Czech,DeepNeuralNetworks,DocumentClassification,Multi-label 8 3 0 1 Introduction . 1 0 The amount of electronic text documents is growing extremely rapidly and therefore 7 automatic document classification (or categorization) becomes very important for in- 1 : formationorganization,storageandretrieval.Multi-labelclassificationisconsiderably v moreimportantthanthesingle-labelclassificationbecauseitusuallycorrespondsbetter i X totheneedsofthecurrentapplications. r The modern approaches usually use several pre-processing tasks: feature selec- a tion/reduction[1];precise documentrepresentation(e.g. POS-filtering,particularlex- ical and syntactic features, lemmatization, etc.) [2] to reduce the feature space with minimal negativeimpact on classification accuracy.However,this pre-processinghas severaldrawbacksasforinstancelossofinformation,significantadditionalimplemen- tationwork,dependencyonthetask/application,etc. Neural networks with deep learning are today very popular in machine learning fieldanditwasprovedthattheyoutperformmanystate-of-the-artapproacheswithout 2 LadislavLencandPavelKra´l anyparametrization.Thisfactisparticularlyevidentinimageprocessing[3],however itwasfurthershowedthattheyarealsosuperiorinNaturalLanguageProcessing(NLP) includingPart-Of-Speech(POS)tagging,chunking,namedentityrecognitionorseman- ticrolelabelling[4].However,tothebestofourknowledge,thecurrentpublishedwork doesnotincludetheirapplicationformulti-labeldocumentclassification. Therefore,themaingoalofthispaperconsistsinusingneuralnetworksformulti- labeldocumentclassificationofCzechtextdocuments.Wewillcompareseveraltopolo- gieswithdifferentnumberofparameterstoshowthattheycanhavebetteraccuracythan thestate-of-the-artmethods. We use and comparestandardfeed-forwardnetworks(i.e. multi-layerperceptron) andpopularConvolutionalNetworks(CNNs).Tothebestofourknowledge,thiscom- parisonwasneverbeendoneonthistaskbefore.Therefore,itisanothercontributionof thispaper.NotethatweexpectbetterperformanceoftheCNNsasshownforinstance intheOCRtask[5]. Theresultsofthisworkshouldbeintegratedintoanexperimentalmulti-labeldocu- mentclassificationsystem.Thesystemshouldbeusedtoreplacemanualannotationof the newspaperdocumentswhich is very expensiveand time consumingtask and thus savethehumanresourcesintheCzechNewsAgency(CˇTK)1. Therestofthepaperisorganizedasfollows.Section2isashortreviewofdocument classificationmethodswitha particularfocusonneuralnetworks.Section3describes ourdocumentclassification approaches.Section4 dealswith experimentsrealized on theCˇTKcorpusandthendiscussestheobtainedresults.Inthelastsection,weconclude theexperimentalresultsandproposesomefutureresearchdirections. 2 Related Work Documentclassificationisusuallybasedonsupervisedmachinelearningmethodsthat exploitanannotatedcorpustotrainaclassifierwhichthenassignstheclassestounla- belleddocuments.ThemostofworksuseVectorSpaceModel(VSM),whichusually representseachdocumentwithavectorofallwordoccurrencesweightedbytheirTerm Frequency-InverseDocumentFrequency(TF-IDF). Severalclassificationmethodshavebeensuccessfullyused[6],forinstanceBayesian classifiers,MaximumEntropy(ME),SupportVectorMachines(SVMs),etc.However, the main issue of this task is thatthe featurespace in the VSM is highlydimensional whichdecreasestheaccuracyoftheclassifier. Numerous feature selection/reduction approaches have been introduced [1,7] to solve this problem.Furthermore,a better documentrepresentationshould help to de- creasethefeaturevectordimension,e.g.usinglexicalandsyntacticfeaturesasshown in[2].Chandrasekaretal.furthershowin[8]thatitisbeneficialtousePOS-tagfiltra- tioninordertorepresentadocumentmoreaccurately. Morerecently,someinterestingapproachesbasedonLatentDirichletAllocation(L- LDA)[9]havebeenintroduced.Anothermethodexploitspartiallabelstodiscoverla- tenttopics[10].PrincipalComponentAnalysis[11]incorporatingsemanticconcepts[12] hasalsobeenusedforthedocumentclassification. 1http://www.ctk.eu DeepNeuralNetworksforCzechMulti-labelDocumentClassification 3 Recently,“deep”NeuralNets(NN)haveshowntheirsuperiorperformanceinmany naturallanguageprocessingtasksincludingPOStagging,chunking,namedentityrecog- nition and semantic role labelling [4] without any parametrization. Several different topologiesandlearningalgorithmswereproposed. For instance, the authors of [13] propose two Convolutional Neural Nets (CNN) forontologyclassification,sentimentanalysisandsingle-labeldocumentclassification. Their networksare composedof 9 layersoutof which6 are convolutionallayersand 3fully-connectedlayerswithdifferentnumbersofhiddenunitsandframesizes.They showthattheproposedmethodsignificantlyoutperformsthebaselineapproaches(bag of words) on English and Chinese corpora.Another interesting work [14] uses in the firstlayer(i.e.lookuptable)pre-trainedvectorsfromword2vec[15].Theauthorsshow that the proposed models outperform the state-of-the-art on 4 out of 7 tasks, which includesentimentanalysisandquestionclassification. Foradditionalinformationaboutarchitectures,algorithms,andapplicationsofdeep learning,pleasereferthesurvey[16]. Ontheotherhand,classicalfeed-forwardneuralnetsarchitecturesrepresentedpar- ticularlybymulti-layerperceptronsareusedratherrarely.However,thesemodelswere verypopularbeforeandsome approachesfordocumentclassification exist.Manevitz etal.showin[17]thattheirsimplefeed-forwardneuralnetworkwiththreelayers(20 inputs, 6 neurons in hidden layer and 10 neurons in the output layer, i.e. number of classes)givesF-measureabout78%onthestandardReutersdataset. Traditional multi-layer neural networks were also used for multi-label document classification in [18]. The authors have modified standard backpropagationalgorithm formulti-labellearningwhichemploysanovelerrorfunction.Thisapproachisevalu- atedonfunctionalgenomicsandtextcategorization. The most of the proposed approaches is focused on English and only few works deal with Czech language.Hrala et al. use in [19] lemmatizationand Part-Of-Speech (POS)filteringforapreciserepresentationofCzechdocuments.In[20],threedifferent multi-labelclassificationapproachesarecomparedandevaluated.Anotherrecentwork proposesnovelfeaturesbasedontheunsupervisedmachinelearning[21].Tothebestof ourknowledge,nodocumentclassificationapproachusingneuralnetsdealswithCzech language. 3 Neural Nets forMulti-label Document Classification 3.1 BaselineClassification ThefeaturesetiscreatedaccordingtoBrychc´ınetal.[21]andiscomposedofwords, stemsandfeaturescreatedbyS-LDAandCOALS.Theyareusedbecausetheauthors experimentallyprovedthat the additionalunsupervisedfeatures significantly improve classificationresults. Formulti-labelclassification,weuseanefficientapproachpresentedbyTsoumakas et al. in [22]. This method employs n binary classifiers Cn : d → l,¬l (i.e. each i=1 binary classifier assigns the document d to the label l iff the label is included in the document,¬lotherwise).Theclassificationresultisgivenbythefollowingequation: 4 LadislavLencandPavelKra´l C(d)=∪n :C (d) (1) i=1 i TheMaximumEntropy(ME)modelisusedforclassification. 3.2 StandardFeed-forwardDeepNeuralNetwork(FDNN) Feed-forwardneuralnetworksareprobablythemostcommonlyusedtypeofNNs.We proposetouseanMLPwithtwohiddenlayerswhichcanbeseenasadeepnetwork2. AsaninputofournetworkweusethesimpleBagofWords(BoW)whichisabinary vectorwherevalue1meansthatthewordwithagivenindexispresentinthedocument. ThesizeofthisvectordependsonthesizeofthedictionarywhichislimitedbyN most frequentwords.Theonlypreprocessingistheconversionofallcharacterstolowercase andalsoreplacingofallnumbersbyonecommontoken. The size of the inputlayer thus dependson the size of the dictionarythat is used forthefeaturevectorcreation.Thefirsthiddenlayerhas1024whilethesecondonehas 512nodes3.Theoutputlayerhassizeequaltothenumberofcategorieswhichis37in our case. To handle the multi-labelclassification, we thresholdthe valuesof nodesin theoutputlayer.Onlythevalueslargerthanagiventhresholdareassignedtothelabels. 3.3 ConvolutionalNeuralNetwork(CNN) TheinputfeatureoftheCNNisasequenceofwordsinthedocument.Weusesimilar documentpreprocessingand also similar dictionary as in the previousapproach.The wordsarethenrepresentedbytheindexesintothedictionary. The first important issue of this network for document classification is variable lengthofdocuments.Itisusuallysolvedbysettingafixedvalueandlongerdocuments areshortenedwhileshorteronesmustbepaddedtoensureexactlythesamelength.The wordsthatarenotinthedictionaryareassignedtoareservedindexandthepaddinghas alsoareservedindex. The architecture of our network is motivated by Kim in [14]. However, we use justonesizeoftheconvolutionalkernelandnotthecombinationofseveralsizes.Our kernelshaveonly1dimension(1D)whileKimhaveusedlarger2dimensionalkernels. This is mainly due to our preliminary experiments where the simple 1 dimensional kernelsgavebetterresultsthanthelargerones. TheinputofournetworkisavectorofwordindexesofthelengthLwhereListhe numberofwordsusedfordocumentrepresentation.Thesecondlayerisanembedding layerwhichrepresentseachinputwordasavectorofagivenlength.Thedocumentis thusrepresentedasamatrixwithLrowsandEMBcolumnswhereEMBisthelength ofembeddingvectors.Thethirdlayeristheconvolutionalone.WeuseN convolution C kernelsofthesizeK ×1whichmeanswedo1Dconvolutionoveronepositioninthe embeddingvectoroverK inputwords.Thefollowinglayerperformsmaxpoolingover thelengthL−K +1resultinginN 1×EMB vectors.Theoutputofthislayeris C thenflattenedandconnectedwiththeoutputlayercontaining37nodes. 2WehavealsoexperimentedwithanMLPwithonehiddenlayerwithloweraccuracy. 3Thisconfigurationwassetexperimentally. DeepNeuralNetworksforCzechMulti-labelDocumentClassification 5 The output of the network is then thresholded to get the final results. The values greaterthanagiventhresholdindicatethelabelsthatareassignedtotheclassifieddoc- ument.ThearchitectureofthenetworkisdepictedinFigure1. Fig.1.Architectureoftheconvolutionalnetwork 4 Experiments InthissectionwefirstdescribetheCzechdocumentcorpusthatweusedforevaluation ofourmethods.Afterthatwedescribetheperformedexperimentsandthefinalresults. The results are compared with previously published results on the Czech document corpus. 4.1 ToolsandCorpus Forimplementationofallneural-netsweusedKerastool-kit[23]whichisbasedonthe Theanodeeplearninglibrary[24].Ithasbeenchosenmainlybecauseofgoodperfor- manceandourpreviousexperiencewiththistool.Allexperimentswerecomputedon GPUtoachievereasonablecomputationtimes. Asalreadystated,theresultsofthisworkshallbeusedbytheCˇTK.Therefore,for the following experiments we used the Czech text documents provided by the CˇTK. Thiscorpuscontains2,974,040wordsbelongingto11,955documents.Thedocuments areannotatedfromasetof60categoriesoutofwhichweused37mostfrequentones. Thecategoryreductionwasdonetoallowcomparisonwithpreviouslyreportedresults on this corpus where the same set of 37 categories was used. Figure 2 illustrates the distributionof the documentsdependingon the numberof labels. Figure 3 showsthe 6 LadislavLencandPavelKra´l 4500 4000 3821 3500 ents 23500000 2693 2723 m ocu 2000 1837 D 1500 1000 656 500 183 41 1 0 1 2 3 4 5 6 7 8 Number of labels Fig.2.Distributionofdocumentsdependingonthenumberoflabels 2500 2000 nts 1500 e m u c o 1000 D 500 0 200 400 600 800 1000 Document length (words) Fig.3.Distributionofthedocumentlengths distribution of the documentlengths (in word tokens). This corpusis freely available forresearchpurposesathttp://home.zcu.cz/˜pkral/sw/. We use the five-folds cross validation procedure for all following experiments, where 20% of the corpus is reserved for testing and the remaining part for training ofourmodels.Forevaluationofthedocumentclassificationaccuracy,weusethestan- dard F-measure (F1) metric [25]. The confidence interval of the experimentalresults is0.6%ataconfidencelevelof0.95[26]. 4.2 ExperimentalResults FDNN Asafirstexperiment,wewouldliketovalidatethepropositionofthresholding appliedto the outputlayeroftheFDNN. Forthistask we use theReceiverOperating Characteristic(ROC)curvewhichclearlyshowstherelationshipbetweenthetruepos- itiveandthefalsepositiveratefordifferentvaluesoftheacceptancethreshold.Weuse 20,000mostcommonwordstocreatethedictionary.TheROCcurveisdepictedinFig- ure4.Accordingtotheshapeofthiscurvewecanconcludethattheproposedapproach issuitableformulti-labeldocumentclassification. DeepNeuralNetworksforCzechMulti-labelDocumentClassification 7 1 MLP 0.8 e e rat 0.6 v siti o p e 0.4 u Tr 0.2 0 0 0.2 0.4 0.6 0.8 1 False positive rate Fig.4.ROCcurveoftheFDNN Inthesecondexperimentwewouldliketoidentifytheoptimalactivationfunction ofthenodesintheoutputlayer.Twofunctions(sigmoidandsoftmax)arecomparedand evaluated.We have evaluatedthe threshold valuesin interval[0;1], howeveronly the bestclassification scoresare depicted(see Table 1, bestthresholdvaluesin brackets). Thistable showsthatthesoftmaxgivesbetterresults. Based on these results, we will furtherusethisactivationfunctionandthethresholdissetto0.11. Table1.ComparisonofoutputlayeractivationfunctionsoftheFDNN(thresholdvaluesdepicted inbrackets) Activationfunction F1[%] softmax 83.8(0.11) sigmoid 82.3(0.48) Thethirdexperimentstudiestheinfluenceofthedictionarysizeontheperformance oftheFDNN.Table2showsthedependencyofF-measureonthewordnumberinthe dictionary. This table shows that the previously chosen 20,000 words is a reasonable choiceandfurtherincreasingthenumberdoesnotbringanysignificantimprovement. Table2.F-measureofFDNNwithdifferentnumbersofwordsinthedictionary Wordnumber 1,000 2,000 5,000 10,000 15,000 20,000 25,000 30,000 F1[%] 72.1 76.7 80.9 82.9 83.5 83.8 83.9 83.9 8 LadislavLencandPavelKra´l CNN In all experiments performed with the CNN we use the same dictionary size (20,000words) as in the case of FDNN to allow a straightforwardcomparisonof the results.Accordingtotheanalysisofourcorpusweestimatethatasuitablevectorsize fordocumentrepresentationis400words.AswellasfortheFDNN wefirstcompute theROCcurvetovalidatethepropositionofthresholdingintheoutput.Figure5clearly showsthatthisapproachissuitableforourtask. 1 CNN 0.8 e e rat 0.6 v siti o p e 0.4 u Tr 0.2 0 0 0.2 0.4 0.6 0.8 1 False positive rate Fig.5.ROCcurveoftheCNN Asasecondexperimentweidentifyanoptimalactivationfunctionofneuronsinthe outputlayer.Wecomparethesoftmaxandsigmoidfunctions.TheachievedF-measures are depictedin Table3. Itis clearlyvisible thatin this case the sigmoidfunctionper- formsbetter.Wewillthususethesigmoidactivationfunctionandthethresholdwillbe setto0.1forallfurtherexperimentswithCNN. Table3.ComparisonofoutputlayeractivationfunctionsoftheCNN(thresholdvaluesdepicted inbrackets) Activationfunction F1[%] softmax 80.9(0.07) sigmoid 84.0(0.10) Inthisexperiment,wewillshowtheimpactofthenumberofconvolutionalkernels in our network on the classification score. 400 words are used for document repre- sentation (L = 400) and the embedding vector size is 200. This experiment shows (see Table 4) that this parameter influences the classification score only very slightly (∆F1∼+1%).Allvaluesfrominterval[20;64]aresuitableforourgoalandtherefore wechosethevalueof40forfurtherexperimentation. DeepNeuralNetworksforCzechMulti-labelDocumentClassification 9 Table4.F-measureofCNNwithdifferentnumbersofconvolutionalkernels Kernelno. 12 16 20 24 28 32 36 40 44 48 52 56 60 64 F1[%] 83.1 83.4 84.0 83.9 84.1 84.1 84.1 84.1 84.1 84.2 84.2 84.1 84.1 84.0 ThefollowingexperimentshowsthedependencyofF-measureonthesizeofcon- volutional kernels. We use 40 kernels and the size of the kernel varies from 2 to 40. ThesizeofthekernelscanbeinterpretedasthelengthofwordsequencesthattheCNN workswith.Figure6showsthattheresultsarecomparableandasagoodcompromise wechosethesizeof16forthefollowingexperiments. 0.9 0.85 e ur s ea 0.8 m F - 0.75 0.7 5 10 15 20 25 30 35 40 Convolutional kernel size Fig.6.DependencyofF-measureonthesizeofconvolutionalkernel Finally,wetestedournetworkwithdifferentnumbersofinputwordsandwithvary- ingsizeoftheembeddingvectors.Table5showstheachievedresultswithseveralcom- binationsoftheseparameters.Wecanconcludethatthe400wordsthatwechoseatthe beginningwasareasonablechoice.However,itisbeneficialto uselongerembedding vectors.Itmustbenotedthatthefurtherincreasingoftheembeddingsizehasastrong impactonthecomputationtimeandmightbenotpracticalforreal-worldapplications. SummaryoftheResults Table6comparestheresultsofourapproacheswithanother efficient method [21]. The results show that both proposed approaches significantly outperformthisbaselineapproachthatusesseveralfeatureswithMEclassifier. 5 Conclusions and Future Work In this paper, we have used two different neural nets for multi-label document clas- sification of Czech text documents. Several experiments were realized to set optimal networktopologiesandparameters.Animportantcontributionistheevaluationofthe 10 LadislavLencandPavelKra´l Table5.F-measureofCNNwithdifferentwordnumbersanddifferentembeddingsizes[%] Wordnumber 100 200 300 400 500 Embeddinglength 100 82.3 82.7 83.1 83.3 83.4 150 82.9 83.3 83.4 83.7 83.9 200 82.9 83.4 83.8 84.0 84.1 250 82.5 83.8 84.1 84.2 84.4 300 83.2 83.9 84.3 84.3 84.5 350 83.4 83.9 84.3 84.6 84.5 400 83.4 83.9 84.4 84.7 84.7 450 83.4 83.9 84.5 84.8 84.3 Table6.Comparisonoftheresultsofourapproacheswithmaximumentropybasedmethod Method PrecisionRecallF1[%] Brychc´ınetal.[21] 89.0 75.6 81.7 FDNN 83.7 83.6 83.9 CNN 86.4 82.8 84.7 performanceofneuralnetworksusingsimplefeatures.ThereforewehaveusedtheBoW representationfortheFDNNandsequenceofwordindexesfortheCNNastheinputs. Basedontheseexperimentswecanconclude: – the twoproposednetworktopologiestogetherwith thresholdingoftheoutputare efficientformulti-labelclassificationtask – softmaxactivationfunctionisbetterforFDNN,whilesigmoidactivationfunction givesbetterresultsforCNN – CNNoutperformsFDNNonlyveryslightly(∆F1∼+0.6%) – themostimportantisthefactthatbothneuralnetswithonlybasicpre-processing andwithoutanyparametrizationsignificantlyimprovethe baselinemaximumen- tropymethodwitharichsetofparameters(∆F1∼+4%) Basedontheseresults,wewanttointegrateCNNintoourexperimentaldocument classificationsystem. In this paper, we have used relatively simple convolution neural network. There- fore,ourfirstperspectiveconsistsindesigningamorecomplicatedCNNarchitecture. Accordingtotheliterature,weassumethatmorelayersinthisnetworkwillhaveapos- itiveimpactontheclassificationscore.Ourembeddinglayerwasalsonotinitializedby somepre-trainedsemanticvectors(e.g.word2vecorGloVe).Anotherperspectivethus consistsininitializingoftheembeddingCNNlayerwithpre-trainedvectors. Acknowledgements This work has been supported by the project LO1506 of the Czech Ministry of Edu- cation,YouthandSports.We alsowouldlike tothankCzech NewAgency(CˇTK)for supportandforprovidingthedata.

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.