ebook img

New Trends in Classification and Data Mining PDF

203 Pages·4.235 MB·English
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview New Trends in Classification and Data Mining

Krassimir Markov, Vladimir Ryazanov, Vitalii Velychko, Levon Aslanyan (editors) New Trends in Classification and Data Mining I T H E A SOFIA 2010 Krassimir Markov, Vladimir Ryazanov, Vitalii Velychko, Levon Aslanyan (ed.) New Trends in Classification and Data Mining ITHEA® Sofia, Bulgaria, 2010 First edition Recommended for publication by The Scientific Concil of the Institute of Information Theories and Applications FOI ITHEA This book maintains articles on actual problems of classification, data mining and forecasting as well as natural language processing: - new approaches, models, algorithms and methods for classification, forecasting and clusterisation. Classification of non complete and noise data; - discrete optimization in logic recognition algorithms construction, complexity, asymptotically optimal algorithms, mixed-integer problem of minimization of empirical risk, multi-objective linear integer programming problems; - questions of complexity of some discrete optimization tasks and corresponding tasks of data analysis and pattern recognition; - the algebraic approach for pattern recognition - problems of correct classification algorithms construction, logical correctors and resolvability of challenges of classification, construction of optimum algebraic correctors over sets of algorithms of computation of estimations, conditions of correct algorithms existence; - regressions, restoring of dependences according to training sampling, parametrical approach for piecewise linear dependences restoration, and nonparametric regressions based on collective solution on set of tasks of recognition; - multi-agent systems in knowledge discovery, collective evolutionary systems, advantages and disadvantages of synthetic data mining methods, intelligent search agent model realizing information extraction on ontological model of data mining methods; - methods of search of logic regularities sets of classes and extraction of optimal subsets, construction of convex combination of associated predictors that minimizes mean error; - algorithmic constructions in a model of recognizing the nearest neighbors in binary data sets, discrete isoperimetry problem solutions, logic-combinatorial scheme in high-throughput gene expression data; - researches in area of neural network classifiers, and applications in finance field; - text mining, automatic classification of scientific papers, information extraction from natural language texts, semantic text analysis, natural language processing. It is represented that book articles will be interesting as experts in the field of classifying, data mining and forecasting, and to practical users from medicine, sociology, economy, chemistry, biology, and other areas. General Sponsor: Consortium FOI Bulgaria (www.foibg.com). Printed in Bulgaria Copyright © 2010 All rights reserved © 2010 ITHEA® – Publisher; Sofia, 1000, P.O.B. 775, Bulgaria. www.ithea.org ; e-mail: [email protected] © 2010 Krassimir Markov, Vladimir Ryazanov, Vitalii Velychko, Levon Aslanyan – Editors © 2010 Ina Markova – Technical editor © 2010 For all authors in the book. ® ITHEA is a registered trade mark of FOI-COMMERCE Co. ISBN 978-954-16-0042-9 C\o Jusautor, Sofia, 2010 New Trends in Classification and Data Mining 3 PREFACE ITHEA International Scientific Society (ITHEA ISS) is aimed to support growing collaboration between scientists from all over the world. The scope of the books of the ITHEA ISS covers the area of Informatics and Computer Science. ITHEA ISS welcomes scientific papers and books connected with any information theory or its application. ITHEA ISS rules for preparing the manuscripts are compulsory. ITHEA Publishing House is the official publisher of the works of the members of the ITHEA ISS. Responsibility for papers and books published by ITHEA belongs to authors. This book maintains articles on actual problems of classification, data mining and forecasting as well as natural language processing: - new approaches, models, algorithms and methods for classification, forecasting and clusterisation. Classification of non complete and noise data; - discrete optimization in logic recognition algorithms construction, complexity, asymptotically optimal algorithms, mixed-integer problem of minimization of empirical risk, multi-objective linear integer programming problems; - questions of complexity of some discrete optimization tasks and corresponding tasks of data analysis and pattern recognition; - the algebraic approach for pattern recognition - problems of correct classification algorithms construction, logical correctors and resolvability of challenges of classification, construction of optimum algebraic correctors over sets of algorithms of computation of estimations, conditions of correct algorithms existence; - regressions, restoring of dependences according to training sampling, parametrical approach for piecewise linear dependences restoration, and nonparametric regressions based on collective solution on set of tasks of recognition; - multi-agent systems in knowledge discovery, collective evolutionary systems, advantages and disadvantages of synthetic data mining methods, intelligent search agent model realizing information extraction on ontological model of data mining methods; - methods of search of logic regularities sets of classes and extraction of optimal subsets, construction of convex combination of associated predictors that minimizes mean error; - algorithmic constructions in a model of recognizing the nearest neighbors in binary data sets, discrete isoperimetry problem solutions, logic-combinatorial scheme in high-throughput gene expression data; - researches in area of neural network classifiers, and applications in finance field; - text mining, automatic classification of scientific papers, information extraction from natural language texts, semantic text analysis, natural language processing. It is represented that book articles will be interesting as experts in the field of classifying, data mining and forecasting, and to practical users from medicine, sociology, economy, chemistry, biology, and other areas. 4 I T H E A The book is recommended for publication by the Scientific Concil of the Institute of Information Theories and Applications FOI ITHEA. Papers in this book are selected from the ITHEA ISS Joint International Events of Informatics "ITA 2010”: CFDM Second International Conference "Classification, Forecasting, Data Mining" i-i-i International Conference "Information - Interaction – Intellect" i.Tech Eight International Conference "Information Research and Applications" ISK V-th International Conference “Informatics in the Scientific Knowledge” MeL V-th International Conference "Modern (e-) Learning" KDS XVI-th International Conference "Knowledge - Dialogue – Solution" CML XII-th International Conference “Cognitive Modeling in Linguistics” INFOS Thirth International Conference "Intelligent Information and Engineering Systems" NIT International Conference “Natural Information Technologies” GIT Eight International Workshop on General Information Theory ISSI Forth International Summer School on Informatics ITA 2010 took place in Bulgaria, Croatia, Poland, Spain and Ukraine. It has been organized by ITHEA International Scientific Society in collaboration with:  ITHEA International Journal "Information Theories and Applications"  ITHEA International Journal "Information Technologies and Knowledge"  Institute of Information Theories and Applications FOI ITHEA  Dorodnicyn Computing Centre of the Russian Academy of Sciences (Russia)  Universidad Politecnica de Madrid (Spain)  Institute of Linguistics, Russian Academy of Sciences (Russia)  Association of Developers and Users of Intelligent Systems (Ukraine)  V.M.Glushkov Institute of Cybernetics of National Academy of Sciences of Ukraine  Institute of Mathematics and Informatics, BAS (Bulgaria)  Institute of Mathematics of SD RAN (Russia)  Taras Shevchenko National University of Kiev (Ukraine)  New Bulgarian University (Bulgaria)  The University of Zadar (Croatia)  Rzeszow University of Technology (Poland)  BenGurion University (Israel)  University of Calgary (Canada)  University of Hasselt (Belgium)  Kharkiv National University of Radio Electronics (Ukraine)  Kazan State University (Russia)  Alexandru Ioan Cuza University (Romania)  Moscow State Linguistic University (Russia)  Astrakhan State University (Russia) as well as many other scientific organizations. For more information: www.ithea.org . We express our thanks to all authors, editors and collaborators as well as to the General Sponsor. The great success of ITHEA International Journals, International Books and International Conferences belongs to the whole of the ITHEA International Scientific Society. Sofia – Moscow – Kiev - Erevan, June 2010 K. Markov, V. Ryazanov, V. Velychko, L. Aslanyan New Trends in Classification and Data Mining 5 TABLE OF CONTENT Preface ..................................................................................................................................... 3  Table of Content....................................................................................................................... 5  Index of Authors ....................................................................................................................... 7  Minimization of Empirical Risk in Linear Classifier Problem  Yurii I. Zhuravlev, Yury Laptin, Alexander Vinogradov ............................................................. 9  Restoring  of  Dependences  on  Samplings  of  Precedents  with  Usage  of  Models  of  Recognition  V.V.Ryazanov, Ju.I.Tkachev ...................................................................................................... 17  Composite Block Optimized Classification Data Structures  Levon Aslanyan, Hasmik Sahakyan .......................................................................................... 25  Synthesis of Corrector Family with High Recognition Ability  Elena Djukova, Yurii Zhuravlev, Roman Sotnezov .................................................................... 32  Methods for Evaluating of Regularities Systems Structure  Kostomarova Irina,  Kuznetsova Anna, Malygina Natalia, Senko Oleg .................................... 40  Growing Support Set Systems in Analysis of High‐Throughput Gene Expression Data  Arsen Arakelyan, Anna Boyajian, Hasmik Sahakyan,  Levon Aslanyan, Krassimira Ivanova, Iliya Mitov ..................................................................... 47  On the Complexity of Search for Conjunctive Rules in Recognition Problems  Elena Djukova, Vladimir Nefedov ............................................................................................. 54  Optimal Forecasting Based on Convexcorrecting Procedures  Oleg Senko, Alexander Dokukin ............................................................................................... 62  Reference‐Neighbourhood Scalarization for Multiobjective Integer Linear Programming  Problems  Krassimira Genova, Mariana Vassileva .................................................................................... 73  Multiagent Applications in Security Systems: New Proposals and Perspectives  Vladimir Jotsov ......................................................................................................................... 82 6 I T H E A Numeric‐Lingual Distinguishing Features of Scientific Documents  Vladimir Lovitskii, Ina Markova, Krassimir Markov, Ilia Mitov ................................................. 94  Data and Metadata Exchange Repository Using Agents Implementation  Tetyana Shatovska, Iryna Kamenieva ...................................................................................... 102  LSPL‐Patterns as a Tool for Information Extraction From Natural Language Texts  Elena Bolshakova, Natalia Efremova, Alexey Noskov .............................................................. 110  Computer Support of Semantic Text Analysis of a Technical Specification on Designing  Software .........................................................................................................................   Alla V. Zaboleeva‐Zotova, Yulia A. Orlova ................................................................................ 119  Linguistics Research and Analysis of the Bulgarian Folklore. Experimental Implementation  of Linguistic Components in Bulgarian Folklore Digital Library  Konstantin Rangochev, Maxim Goynov,  Desislava Paneva‐Marinova, Detelin Luchev .......... 131  Natural Interface to Election Data  Elena Long, Vladimir Lovitskii, Michael Thrasher ..................................................................... 138  Analysis of Natural Language Objects  Oleksii Vasylenko ..................................................................................................................... 147  Базовые  структуры  евклидовых  пространств:  конструктивные  методы  описания  и  использования  Владимир Донченко, Юрий Кривонос, Виктория Омардибирова ..................................... 155  Нейросетевая архитектура на частичных обучениях  Николай Мурга ....................................................................................................................... 170  Разработка  автоматизированной  процедуры  совмещения  изображений  произведений живописи в видимом И рентгеновском спектральных диапазонах  Дмитрий Мурашов ................................................................................................................. 185  Распознавание обектов с неполной информацией и искаженных преобразованиями  из заданной группы в рамках логико‐предметной распознающей системы  Татьяна Косовская .................................................................................................................. 196 New Trends in Classification and Data Mining 7 INDEX OF AUTHORS Alexander Dokukin 62 Mariana Vassileva 73 Alexander Vinogradov 9 Maxim Goynov 131 Alexey Noskov 110 Michael Thrasher 138 Alla Zaboleeva-Zotova 119 Natalia Efremova 110 Anna Boyajian 47 Natalia Malygina 40 Anna Kuznetsova 40 Oleg Senko 40, 62 Arsen Arakelyan 47 Roman Sotnezov 32 Desislava Paneva-Marinova 131 Tetyana Shatovska 102 Detelin Luchev 131 V.V.Ryazanov 17 Elena Bolshakova 110 Vasylenko Oleksii 147 Elena Djukova 32, 54 Vladimir Jotsov 82 Elena Long 138 Vladimir Lovitskii 94, 138 Hasmik Sahakyan 25, 47 Vladimir Nefedov 54 Iliya Mitov 47, 94 Yulia Orlova 119 Ina Markova 94 Yurii Zhuravlev 9, 32 Irina Kostomarova 40 Yury Laptin 9 Iryna Kamenieva 102 Виктория Омардибирова 155 Ju.I.Tkachev 17 Владимир Донченко 155 Konstantin Rangochev 131 Дмитрий Мурашов 185 Krassimir Markov 94 Мурга Николай 170 Krassimira Genova 73 Татьяна Косовская 196 Krassimira Ivanova 47 Юрий Кривонос 155 Levon Aslanyan 25, 47 8 I T H E A New Trends in Classification and Data Mining 9 MINIMIZATION OF EMPIRICAL RISK IN LINEAR CLASSIFIER PROBLEM Yurii I. Zhuravlev, Yury Laptin, Alexander Vinogradov Abstract: Mixed-integer formulation of the problem of minimization of empirical risk is considered. Some possibilities of decision of the continuous relaxation of this problem are analyzed. Comparison of the proposed continuous relaxation with a similar SVM problem is performed too. Keywords: cluster, decision rule, discriminant function, linear and non-linear programming, non-smooth optimization ACM Classification Keywords: G.1.6 Optimization - Gradient methods, I.5 Pattern Recognition; I.5.2 Design Methodology - Classifier design and evaluation Acknowledgement: This work was done in the framework of Joint project of the National Academy of Sciences of Ukraine and the Russian Foundation for Fundamental Research No 08-01-90427 'Methods of automatic intellectual data analysis in tasks of recognition objects with complex relations'. Introduction Recently considerable number of researches are devoted to problems of construction of linear algorithms of classification (classifiers). In many cases such problems are considered for classification of two sets. Usually linear classifier problems are formulated for the case of linearly separable sets. In separable case the mentioned problems can be efficiently solved [1–4]. The concept of optimality for two linearly separable sets has a simple geometrical sense – the optimum classifier defines the strip of maximal width separating these sets. For linear separability of two finit sets it is necessary and sufficient for convex envelops of these sets don’t intersect each other. But this condition is not sufficient in the case of more than two sets. In [5–7] some sufficient conditions of linear separability of any number of final sets are formulated. Minimization of the empirical risk is the natural criterion of choice of the classifier in case of linearly inseparable sets. In this paper, a mixed-integer formulation of the problem of minimization of empirical risk is considered, and some possibilities of decision of the continuous relaxation of this problem are analyzed. Comparison of the proposed continuous relaxation with a similar SVM problem is performed too. 1. Problem formulation Let a set of linear functions is defined f (x,Wi)(wi,x)wi, i1,..., m, where xRn is attribute i 0 i i i n1 vector, and W (w ,w )R , i 1,...,m, are vectors of parameters. We denote 0 1 m L W (W ,...,W ), W R , L m(n1). Let's consider linear algorithms of classification (linear classifiers) of the following kind  i  n L a(x,W)argmax f (x,W ): i 1,..., m ;xR ;W R i (1) i 10 I T H E A In [6] also classifiers, in which f are convex piece-wise linear functions, were investigated. i Here it is considered a family of finite not intersected sets  , i1,..., m. We will say that the classifier i a(x,W) separates correctly points from  , i1,..., m, if a(x,W)i for all x , i1,..., m. i i Sets  , i1,..., m are called linearly separable if there is a linear classifier correctly separating points from i these sets. Each set i, i1,..., m is a training sample of points from some class i known only on these sample units. The training process for classifier a(x,W) consists in selection of parameters W at which classes i, i1,..., m are separated in the best way (in some sense). For definition of the quality of separation various approaches are used. m Let  , points of the set  are enumerated, T is the set of indices,  xt : tT, T is a i i i1 m subset of indices correponding to points from  ,  xt : tT, T T . Let function i(t) returns i i i i i1 t the index of the set  , to which the point x belongs, tT . The value i gt(W)minf (xt,Wi) f (xt,W j): j1,..., m\i, i i(t) i j min(wi wj,xt)wi wj : j1,..., m\i, i i(t) (2) 0 0 t is called as a margin or a gap of the classifier a(x,W) on the pointx , tT . The classifier a(x,W) makes a mistake in a pointxt iff the gap gt(W) is negative. The valueg(W)mingt(W):tT is called as a gap of the classifier a(x,W) on the family of sets  , i1,..., m. i The classifier a(x,W) correctly separates points from ,i1,...,m , ifg(W)0 . i i Remark 1. The classifier a(x,W) is invariant with respect to multiplication of all functions f (vectorsW ) by i positive number, and the gap g(W) is linear with respect to such multiplication. The classifiera(x,W) and the gap g(W) are invariant concerning to addition of any real number to all f . i The valueg(W) can be used as a criterion of quality of the classifier a(x,W) (the more valueg(W), the more reliably points from  ,i1,...,m are separated). However, it is necessary to take into account a norm for i the family of vectors W which we denote (W) and name norm of the classifier a(x,W). Let's use the following function: m n (W) (wi )2 (3) k i1k1

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.