Table Of Content

First Study on Data Readiness Level Hui Guan∗ Thanos Gentimis† Hamid Krim∗ James Keiser‡ Abstract nounced. Forexample,nomeaningfulpatternswouldbe recognized if the data itself doesn’t contain much infor- We introduce the idea of Data Readiness Level (DRL) to mation, or even worse, phantom patterns will appear if measure the relative richness of data to answer specific the data is not relevant. questions often encountered by data scientists. We first 7 Metrics that can evaluate the sufficiency, effective- approach the problem in its full generality explaining its 1 ness and value of the collected data will bring incalcu- desired mathematical properties and applications and then 0 lablebenefitsonselectingvaluabledatasetstoanalyze. 2 we propose and study two DRL metrics. Specifically, we Not only will the unnecessary cost of collecting redun- defineDRLasafunctionofatleastfourpropertiesofdata: n dantdatabereduced,butalsotheefficiencyofobtaining Noisiness, Believability, Relevance, and Coherence. The a insightfulresultswillbeimproved. TheDataReadiness J information-theoretic based metrics, Cosine Similarity and Level (DRL) measurement was first proposed by the 8 DocumentDisparity,areproposedasindicatorsofRelevance 1 and Coherence for a piece of data. The proposed metrics LaboratoryforAnalyticSciencesatNCSateUniversity as a means to quantify that relevance. As the name arevalidatedthroughatext-basedexperimentusingTwitter ] suggests, the goal of DRL is to measure the relative R data. readiness/richness of data to answer specific questions I . 1 Introduction by various techniques. s c DRLshouldbeagenericmeasureappliedtoavari- Data nowadays are produced at an unprecedented rate; [ ety of data modalities, with inference/decision/answer Cheap sensors, existing and synthesised datasets, the toaquestionasasharedgoal. Specifically,DRLshould 1 emerginginternetofthings,andespeciallysocialmedia, v be a function of at least four properties of data: Nois- make the collection of vast and complicated datasets a 7 iness, Believability, Relevance, and Coherence. In this relatively easy task. The era of big data is obviously 0 paper,wefocusondatasetscomprisedprimarilyofdoc- 1 here but unfortunately without an equal advance in uments since this is one of the most common datasets 2 the science of understanding it yet. With limited time available and the literature in the field is rich enough 0 and human power, the ability to effectively harness the . to exploit various techniques. We assume that a set of 2 power of big data is a problem encountered by many documents is given to help answer a specific question. 0 companies and organizations. Enormous amounts of That dataset may contain relevant and irrelevant infor- 7 datatakeuptoomuchstorageandcomputingresources. 1 mation and it does not have to be structured. The goal That,however,doesnotnecessarilymeananincreaseof : is to help guide a data scientist with a limited access v valuable information or better actionable items. Only to the entirety of the data corpus or alternatively with Xi small amounts of data may address questions due to a limited time to analyze it, to quantitatively assess noise, redundancy, and non-relevance. r the capacity of a subset of that corpus to answer the a To increase effectiveness in data storage and han- question. This will be achieved by developing a com- dling,companiesandorganizationsarefocusingondriv- putational measure to reflect the readiness of available ing robust data analysis techniques, such as Robust data to answer such a question. PrincipalComponentAnalysis[1]andK-medoids[2][3], This is, to the best of our knowledge, the first to extract insights from data and downsize storage by time this notion of “goodness of unstructured data” keeping only the relevant information. However, the has been addressed in the context of data analytics for generalthemeof“garbagein,garbageout”stillapplies, seeking answers to specific questions. We discuss here andwithbigdata,theproblembecomesevenmorepro- an unsupervised approach to computing DRL metrics on a wide variety of data. Moreover, we demonstrate ∗Department of Electrical and Computer Engineering, North its successful application to a collection of tweets. For CarolinaStateUniversity. Email: {hguan2,ahk}@ncsu.edu †DepartmentofMathematics,FloridaPolytechnicUniversity. computational efficiency and tractability, we use the Email: agentimis@flpoly.org assumptionof“Bag-of-Words”(BOW)inthecasestudy ‡Laboratory for Analytic Sciences. Email: keis- on tweets. In the BOW model, a text is represented [email protected] as the collection of its words, disregarding grammar 5. A discriminative property, to make it useful for a and word order but keeping multiplicity [4]. To that datascientistinreachingadecisionaboutthedata end, topic modeling [5], often used in text mining and at hand. naturallanguageprocessingasadimensionreduction,is These properties are intrinsic to our proposed tech- the adopted strategy in our work. The Latent Dirichlet niques and, up to minor adjustments, can be shown Allocation approach [6] was selected, but any other to be stable under perturbations/changes of the input approach (e.g. Latent Semantic Indexing [7], or Non- data, in our case, documents. The mathematical devel- negativeMatrixFactorization[8])couldhavebeenused, opmentofourDRLwillfurtherunveilthatthefollowing just as well. Cosine Similarity and Document Disparity have to be accounted for: are computed to measure Relevance and Coherence 1. Data. Given the application scope of DRL, di- properties for a specific text collection. The underlying verse data modalities may be addressed and any assumption is that a relevant set of documents should proposed metric or framework must address a va- have high similarity to the question and low internal riety of data. The data may be noisy and unstruc- disparity,indicatinghighRelevanceandhighCoherence tured. respectively. Ourformulationisfullydata-driven,hence 2. Analytics. Given the great diversity of analyt- more “natural” and better suited to the underlying ics/transformations and their specificity to prob- structure of the data, and the proposed metrics are lems, DRL may be operating along a number of relatively easy to compute. possibledimensions,includingtime,cost,andcom- The rest of this paper is organized as follows. In putational complexity. section 2, we discuss various required properties for a sound DRL formulation and several relevant factors 3. Objective Question (Question of Interest). in its definition. In section 3 we provide an overview The nature of the proposed question may vary of the mathematical, statistical and machine learning widely. Thus DRL must be flexible enough to ac- relevant tools we used. In sections 4 and 5, we describe commodatebinary,quantitativeand/orprobabilis- the theoretical framework of our proposed method and tic questions and answers. Additionally, questions the methodology for implementing the framework. For may vary in substance and detail, and hence em- validation as well as illustrative purposes, we present a anatefromapopulationwithacertaindistribution, practical example on tweets in section 6. Some of the so an associated DRL will accordingly be in some related work is described in the section 7 followed by a sensea“conditional”quantityrelativetothequery. conclusion section 8. To further clarify the problem statement, we focus ontheapplicationofDRLtounstructuredtextdata. To 2 Definition of DRL thatend,weassumethatacorpusofdocumentsisgiven DRL is a function on a data set, which has been sub- to help answer a specific question. The goal of DRL jectedtoasequenceoftransformationstowardsanswer- is to quantitatively describe the amount of valuable ing a query. In that sense, it reflects a degree of matu- information contained in the set of documents related rity towards accomplishing a target task. As a metric, to the question. We consider four dimensions of the set it may be evaluated at various stages of transformation of documents in computing DRL: ofdata, andwillhencereflecttheefficacyofeachofthe 1. Relevance: theextenttowhichdataisapplicable various stages/analytics, or in one shot for the entire and helpful for the question at hand. flow,withagoalofrefininganddistillingthedatamore closely to successfully resolve the query. 2. Coherence: theextenttowhichdataisconsistent As a valid and useful measure, DRL has to satisfy in subjects or topics. the following properties: 3. Believability: the extent to which data is re- 1. Easy computability. This ensures that DRL regarded as true and credible. mains more advantageous than actually carrying 4. Noisiness: the extent to which data is clean and out the information analysis. correct. 2. Stability and continuity. This will imply that any The dimensions can be modified or discarded depend- two close formulations (data, question, analytics) ing on the application. For example, for time-sensitive will yield similar DRL values. questions, the timeliness dimension should also be con- 3. Scalability. This will safeguard its viability with sidered. In this paper, we account for only these four increasing data size. dimensions of datato makeDRL as generic as possible. 4. A maximally unsupervised property, but with the Mathematically, let D be the space of documents allowance of fine tuning the parameters. and let D be a finite set of documents of interest with cardinality N, i.e. D = {d ,··· ,d } ⊂ D, where d 3. For each word w in jth position of the ith 1 N i i,j corresponds to the i-th document. Similarly, let Q be document, where i∈{1,··· ,N},j ∈{1,··· ,N }: i the space of questions and let q ∈ Q be a specific (a) Choose a topic z ∼Multinomial(θ ), question. Our goal is to compute DRL(D|q). The i,j i connection between the space of documents and that (b) Choose a word w ∼Multinomial(ϕ ). i,j zi,j of questions is obvious if we treat questions as micro- documents. In this sense q∈Q⊂D. 3.2 The Semantic Space of Queries To better Ideally, DRL should be a function that maps any contextualize DRL conditioned on a question q, a class set of documents to some fixed numerical range, for of questions Q conveying a similar “message” with pos- example, [0,1], with 0 indicating no value at all and 1 siblydifferentwords,issought. Thismaybethoughtof indicating the most valuable data: f : 2D → [0,1]. An as defining a semantic equivalence class to q. Let again application of this would be to focus on the documents D be the set of all documents, each considered as a bag with the maximum possible DRL. Given the sets of of words using a high dimensional lexicon. Each ques- documents {D ,D ,··· ,D }, it is reasonable to only tion, regarded as a mini document, would correspond 1 2 M requiretheproperty: DRL(D |q)>DRL(D |q)i.e. D thentoarepresentationintheSemantic Space denoted i j i is more valuable than D on answering the question q. byS. Thisspacemayroughlybeassociatedtothespace j of distributions over topics, which, recall, are the result 3 Background of the LDA turning documents (a bag of words) into distributions over topics. Thus what we should be bas- To cast our proposed DRL methodology in an ap- ing our DRL on is the set of questions which have the plied setting, and keep this report self-contained for same image under some function: g : D → S. The idea a document-based experimental evaluation, we provide of this semantic space can be easily extended to other some brief background on text mining, making connec- modalities,solongasagoodproximitymeasureisiden- tions with the relevant research in the area. tified. Unfortunately, such a measure is still fairly not well understood just like the notion of this space itself. 3.1 Latent Dirichlet Allocation Since our DRL To proceed with the mathematical development, experimentalvalidationinvolvesdocumentanalysis,itis denote by D a family of documents which is a subset of natural to approach documents as bags of words, as of- D. The LDA approach thus defines a function, tendoneinclassicdatamining. Thesizeofagivencor- pus of documents may, however, be prohibitively large (3.1) g :D →S, making this approach computationally expensive and D negatively impacting the efficiency of the DRL compu- hence associating with each document its semantic tation. Toalleviatethatproblemwehaveadoptedsome meaning. A question q is typically also projected onto tools from topic modeling and specifically the concept this Semantic Space through the same function g . By D of Latent Dirichlet Allocation [6]. We assume thus that constructionthisfunctiong dependsonthedocument D the question and the document that are addressing the set itself, and hence the corresponding index D. It same topics are likely to be relevant and use that as a would be ideal if a general function could be found bottom line for our relevance measure. to attribute meaning to documents without the use of The essence of LDA is that documents are distri- the specific setting; however, g is good enough for our D butions over topics, the latter being themselves prob- purpose. ability distributions over words. Suppose that D = As shown in figure 1, the documents d ∈ D ⊂ i {d1,··· ,dN} is a set of N documents. Let K be the D,i = 1,2,3 are projected onto the semantic space, number of topics and V be the number of words in the and q ,q are two queries in Q ⊂ D, with the same 1 2 vocabulary. Let ϕk ∈ RV,k = 1,··· ,K be the dis- representation in S. The “distance” between the query tribution of words corresponding to topic k. Given a and the documents should be computed in terms of document di ∈ D , let Ni be the number of words and distances in the semantic space. In so doing, the DRL θi ∈ RK be the distribution of topics. The generative will depend on the “meaning” of the question posed, probabilistic model defined in LDA is as follows: and avoid being indicative of the specific words used to 1. Chooseθi ∼Dir(α)i∈{1,··· ,N},whereDir(α) formulate it. is the Dirichlet distribution with parameter α = Wenote,thataphilosophicalanalogymaybedrawn {α1,··· ,αK}. betweenthisformulationoftheDRLwiththatofInfor- 2. Choose ϕ ∼ Dir(β) k ∈ {1,··· ,K} where, mation Retrieval(IR) as related to question expansion, k Dir(β)istheDirichletdistributionwithparameter which invokes techniques of finding synonyms, stem- β. ming, spelling correction, etc.. We maintain that DRL valueofzerowhenp ,p ,...,p areequalforallα>0 1 2 n and gets its maximum value when p = δ , where i ij δ =1 if i=j and 0 otherwise. Many more properties ij of JR divergence can be found in [9]. 3.4 Sensitivity and Closeness In numerical linear algebraandlinearsystemtheory, anon-singularmatrix A is ill-conditioned if a relative small change in A can causealargerelativechangeinA : (||x−x(cid:48)||)/||x||≤ −1 k∗||B||/||A||, where B is the perturbation input to the system A and k is the condition number k = ||A|| ∗ ||A−1|| [10]. This concept is extended to include non- Figure 1: Space mapping linearsystems,liketheLDA,usingideasfromfunctional analysis on metric spaces as follows: should, however, go further and deeper by breaking down the concept for better understanding of the ques- Definition 2. Given a function f : X → Y where tion,andfordiscoveringtheintentbehindthequestion. (X,d ) and (Y,d ) are metric spaces, we say that f is x x r-locally l-Lipschitz if and only if for all x∈X and for 3.3 Jensen-Renyi Divergence Information theory all y ∈X such that d (x,y)≤r we have that: X and statistics are rich with measures of closeness (conversely divergence) between probability density func- d (f(x),f(y))≤l·d (x,y). Y X tions (PDF). On the other hand, a PDF, itself, is a reflection of intrinsic behavior of a given population. It The idea of a Lipschitz constant l is connected to then makes eminent sense to seek the same notion of the derivative of a function at a point (recall that a the concentration of data in a given population, across differential invokes a denominator as dX(x,y)). If l is populations. That is precisely what is reflected in in- smallerthan1wesaythatf isacontractionorinother formation theoretic measures such as divergence (for wordsmakessmallerrorsevensmaller(stabilizes). Ifthe example, Kullback-Liebler (KL) divergence). To that two spaces, however, have wildly different metrics, the end, we build on some of our previous work in develop- Lipschitz constant is not very informative. To alleviate ing measures across an arbitrary number of PDFs , as this problem we normalize the two metrics and define a a step beyond the well known and classical KL diver- new sensitivity number as follows: gencebetweentwoPDFs. Viewinghencealldocuments Definition 3. (Sensitivity Number). Given a func- asdistributionsovertopicsusingtheLDAtechnique,we tion f : X → Y where (X,(cid:107)·(cid:107) ) and (Y,(cid:107)·(cid:107) ) are proceed to first define the so-called Jensen-Renyi (JR) X Y spaces with norms, we define the relative r-sensitivity divergence [9]. at the point x to be: 0 Definition 1. Let p ,p ,...,p be n probability dis- 1 2 n (cid:107)f(x)−f(x )(cid:107) (cid:107)x−x (cid:107) tributions on a finite set x ,x ,...,x , k ∈ N. s (x )=inf{ 0 Y ÷ 0 X| Each probability distribution i1s p2 = (pk,p ,...,p ), (3.3) 1 0 (cid:107)f(x0)(cid:107)Y (cid:107)x0(cid:107)X 1 2 k (cid:80)kj=1pj = 1,pj = P(xj) ≥ 0. Let ω = (ω1,ω2,...,ωn) x∈X, (cid:107)x−x0(cid:107)X ≤r}. be a weight vector and (cid:80)n ω = 1,ω ≥ 0. The JR i=1 i i WewillusethesenotionsinEq.3.3toreportaform divergence is defined as: of stability to small perturbations for our measures. (3.2) (cid:32) n (cid:33) n JRω(p ,...,p )=R (cid:88)ω p −(cid:88)w R (p ), 4 Theoretical Framework α 1 n α i i i α i i=1 i=1 In this section, we will define the two preliminary measures we propose for a functional DRL that correspond where R (p) is the Renyi entropy defined as: α totherelevanceofasetofdocumentstoacertainques- tion and its overall coherence. k 1 (cid:88) R (p)= log pα,α>0,α(cid:54)=1. α 1−α j Definition 4. (Relevance R). Let M denote the j=1 number of sets of documents and N denote the number i The JR divergence is a convex function over of document in the i-th collection. Each document p ,p ,...,p for α ∈ (0,1). It achieves a minimum d(i) is represented as: g(d(i)) = [θ(i),θ(i),...,θ(i)], 1 2 n j j j1 j2 jK i = 1,2,...,M,j = 1,2,...,N , where θ(i) is the then define the Document Disparity to be the JR diver- i jk proportion of topic k in the j-th document on the gence of this set of documents and the reciprocal of the i-th collection, using eq.(3.1). The question is also documentdisparitywillbethecorresponding Coherence: represented by g(q) = [θ ,θ ,...,θ ]. Define then a 1 2 K 1 1 Relevance metric, to be the cosine similarity measure (4.5) C(D )= = . between each document in the collection D and the i DD(Di) JRω(g(d(i)),...,g(d(i)))) i α 1 Ni question q: This function, C : X → R+ is independent of the (4.4) Sim(D ,q)= 1 (cid:88)Ni g(d(ji))·g(q) . question q, and measures how “different” the docu- i Ni (cid:107)g(d(i))(cid:107)·(cid:107)g(q)(cid:107) ments are within their set. A lower disparity measure j=1 j within a set of documents is equivalent to saying that theyarefocusedonthesamecategoryoftopics,namely, Formally, let X beacollectionofsetsofdocuments more coherent. and Q be a set of questions, the function, Sim : In our experiments below we will prove that the X × Q → R+ is a DRL measure which denotes the metrics defined above are contractions and have a low relation between each set and a question. Note that sensitivity number as defined in section 3.4. In general, thismayjustaswellbeappliedtoindividualdocuments any relevant DRL should enjoy these properties. (whenthedocumentsetsizeis1)ortothewholecorpus of documents. Eq.(4.4) obviously induces a relation 5 Methodology between two document sets as follows: Given a set of document collections X = Definition 5. LetD andD betwosetsofdocuments {D ,D ,...,D }, Figure 2 shows our framework 1 2 1 2 M andqanassociatedquestionasdescribedabove. Wesay to compute the aforementioned DRL metrics. Our that D and D are informationally equivalent relative method is comprised of two parts: the first involves a 1 2 to the question q if and only if model training process, while the second evaluates the metrics for relevance and coherence using the previous Sim(D ,q)=Sim(D ,q). 1 2 definitions of relevance and coherence. It is hard to find informationally equivalent document sets in practice, hence warranting a definition of δ- equivalent documents as follows: Definition 6. LetD andD betwosetsofdocuments 1 2 andqanassociatedquestionasdescribedabove. Wesay thatD andD areδ-informationallyequivalentrelative 1 2 to the question q, if and only if Figure 2: Methodology flow chart |Sim(D ,q)−Sim(D ,q)|≤δ. 1 2 In model training, a training corpus D ⊂ X is preprocessed to build an LDA model. Let V be the If we consider the set X under the informational number of words in the vocabulary. According to equivalencerelation,ortheδ informationalequivalence, the BOW assumption, each document d ∈ D is a V- then the function Sim induces a total order relative to dimensional vector d ∈ RV, a point in the so-called the question q. Hence, all sets in X can be compared “word space”. The vector is sparse because only a few and the one which is most relevant to the question can words in the vocabulary will appear in the document. be identified. The “LDA Model Training” module takes as inputs While cosine similarity measures which document the vector representation of the training corpus D and set is more related to the question, it does not provide the number of topics K, and generates an LDA model, any information on how the document set is focused g :X →S, that could map a single document d from on the topic(s) of interest. It is possible that the D a “word space” of V dimension to a “semantic space” document set with higher similarity with the question of K dimension. Note here that the training corpus contains much irrelevant information. To explore that D does not need to be the set of document collections property we propose the idea of Document Disparity as X. Other than the LDA model, other representation an indicator of Coherence: learning approaches such as doc2vec [11] and topic Definition 7. (Coherence C). Let D be a collection modeling methods [5] could be applied to approximate i of documents written as distributions over topics. We the mapping to the semantic space. Since we are more interestedinanalyzingtheresultsofDRL,wewillleave All the tweets were used as the training corpus to the analysis of the best document vector representation train the LDA model. After some experimentation, we for the future. converged on 50 as the number of topics of choice, as it Thesecondpartoftheprocessisthecomputationof gave a set of clear patterns; it is worth noting that this the defined metrics as follows. Given a set of document did not greatly impact the final results, but presented collections X and a specific question q, we project each some computational advantage. This number is also document d ∈ D , D ∈ X, i = 1,2,...,M and the a key parameter for the LDA algorithm, which for the i i question q to the semantic space S using the learned mostpartrepresentstheheaviestcomputationalloadin function g . Both cosine similarity Sim(D ,q) and theexperiment. Fortheactualimplementationweused D i coherence C(D ) are computed in the semantic space the Gensim Python Library [14] and run a Standard i for all D ∈X with Eq.4.4 and Eq.4.5 respectively. LDA on the corpus. i For the metrics computation, we first used LDA 6 Experiment and Results to project the collection of daily tweets as well as the 6.1 Dataset Collection To validate our proposed question to the semantic space. We then computed DRLmeasures,wecarriedoutthefollowingexperiments both the cosine similarity between the tweets and the using Twitter data that we collected. Specifically, the question (as distributions over topics ) as well as the datasetconsistsofasetof463,790tweetscollectedfrom Coherence of each set. In order to account for the non- Twitter’sStreamingAPIs1 duringtheFIBABasketball deterministic nature of LDA, we performed n = 200 WordCup2014. Thecollectedtweetscoveratimespan random computations and have retained the average of from August 30, 2014, to September 18, 2014. the cosine similarity and JR divergence. Tocircumventstructuraldifficultiesassociatedwith Twitter Data when converting text into vectors, we 6.3 Experiment Results Our experiment results performedthefollowingpre-processingsteps: stopwords with the FIBA basketball world cup tweet data, are suchas“the”,“of”and“and”weredeleted;HTTPlinks shown in Figures 4 and 5. We first note the emergence were deleted, leaving only digits and Latin characters; of a time-lag of at most one day in our results, which words of document frequency smaller than 5, and all 1 is likely due to the difference of game occurrence time word-tweets were deleted; text was lowercased. and the tweet reaction time. The cumulative number of 463,204 tweets yielded The question, “Will the USA Basketball team win a vocabulary size of 12,207. The corpus of tweets theworldcupinSpain?”,wasrepresentedasadistribu- was divided into daily sets resulting in a series X = tionovertopicsasshowninfigure3. Thecorresponding {D ,D ,...,D }. The primary objective of the defined topics (top 5 words) were: 1 2 20 experiment was the discovery of the most relevant set Topic 4 : 0.272*win + 0.119*rose + 0.102*derrick + to the following question: Will the USA Basketball 0.085*point + 0.083*performance; team win the world cup in Spain?. We partitioned the Topic 19 : 0.343*world+0.318*cup+0.301*basketball+ 0.008*group + 0.006*liked; tweets to match the FIBA basketball world cup daily Topic 31 : 0.203*spain + 0.156*world + 0.066*worl + schedule. Theassumptionunderlyingourexperimentis 0.060*celebration + 0.056*cup; therebeinganuptickinthemessagesoftheday’sgame Topic 34 : 0.287*usa + 0.228*team + 0.211*world + so that the tweet messages in the days approaching the 0.192*cup + 0.011*home. final should strengthen the relevance of the set to the initial question of interest, namely the ultimate “win of thecupbytheUSA”.Alltweetsaboutunimportant(or of least relevance) games on a given day are effectively noisier. Conversely, all Team USA games around key dates (preliminary rounds, knockout games etc.) lead to a large number of daily tweets, which in turn reflect the fans’ opinion-prognosis about the question at hand. 6.2 Experiment Design Each tweet was regarded asadocument;notethatforsimplicity,notweetpooling schemes[12,13](aggregatingtweetsintoonedocument) wereused(albeitanotherviablealternativetoconsider). Figure 3: Topic proportion for the query Figure 4 shows the average cosine similarity mea- 1https://stream.twitter.com/1.1/statuses/filter.json sure for the daily sets of tweets with the question q, as well the variance over 200 iterations. We quickly note on account of the fact that Team USA appeared set thatthedates9.2−9.3,9.11−9.12and9.14−9.15show to qualify for the finals, and possibly a repeat victory. relativelyhighcosinesimilarity,withthatof9.14−9.15 Various tweets of that day make predictions about the being the highest possible. According to the FIBA bas- winners of the final and mention the fact that the USA ketballcalendar, USAqualifiedforthesecondroundon is going to win again. 9.3, while mathematically on 9.2, thus explaining the high relevance of the tweets of that day to USA’s team. We also note that on the 11th of September, USA beat Lithuania during the semi-finals game and qualified for the final. The day of the final, 9.14, is clearly the day when the relation of the tweets is as close as possible to our question, thus revealing in the process the highest output that day. Figure 5: Daily Document Disparity In light of the above results, we believe that both measures present a good foundation for DRL, by infer- ringrelationsofthedatasettoaquestionofinterest. As a result, a data scientist may opt to bypass other sets of documents and focus on the ones with high cosine similarity and small document disparity. Conversely, onemayargue,thatforacertainclassofquestions,one Figure 4: Daily Cosine Similarity shouldinsteadselectahighcosinesimilarityandahigh Whileonewouldexpectasmallerspikeinourgraph documentdisparitytoreflectthehighlyexploratoryna- on 9.9 − 9.10, when USA beat Slovenia in the quar- ture of a posed question, rather that fact-driven. The terfinals, another concurrent event (game of Lithuania proposedmethod, providesinbothcase, asteptowards versus Turkey) came in to spread the expected focus an automated, semi-supervised method to measure the of the day on Team USA, caused by additional tweets readiness of a given data set. It remains that a true about this other game. This amounts to declaring the quantitative DRL is expected to be some function of datasetatthatpoint“lessready”toanswerthequestion the two metrics discussed above. of interest since more than one topic is discussed that day. This is more easily seen in the Document Dispar- 6.4 PerturbationTest Ourgoalistoshowthatany ity figure 5 where the DD for 9.9 is one of the highest little perturbation of the query, q, will only minimally recorded, so the coherence being the reciprocal of that impact the LDA model (same projection function) to is the lowest. hence provide a relatively consistent semantic represen- Looking now at figure 5, we can see that the lowest tation, i.e. a small r-sensitivity number s1(q). Given document disparity is found on 9.14 − 9.15. Again the clear difficulty of computing the infimum over all this is easy to explain since the “Finals” is an event possible perturbations of the query, we provide the sen- of central importance, and would hence be central to sitivity number for various different cases of perturba- the tweets. We hasten to also point out that during tion and define a corresponding sensitivity quotient: the first few days (8.30−9.2), a very high DD was due relative change in semantic representation s (q,q )= to the period of preliminary rounds when noise topics 1 p relative change in query wereemerging(atleastoneforeachteamparticipating). ||g (q)−g (q )||·||q|| This trend carried on to 9.9−9.10, at the conclusion of = D D p . ||q−q ||·||g (q)|| p D the quarter-finals, when the discussed topics are now more concentrated on the remaining teams and their In section 3 we defined the first DRL measure of upcoming final games. We believe the small dip in the a set of documents D with respect to a question q to disparityappearingon9.2−9.3isrelatedtothepeakof be Sim(D,q). The goal is to then see how sensitive thecosinesimilarityforthosesamedays,andmostlikely the DRL measure is to a change in the query. Testing this sensitivity number for a set of documents with high DRL(tweets from Sept.15) and another with low DRL (tweets from Sept.7) yields, much like s (q,q ), 1 p the normalized sensitivity quotient as, relative change in DRL (a) qa1 (b) qa2 (c) qa3 s (X,q,q )= 2 p relative change in query |Sim(X,q)−Sim(X,q )|·||q|| = p . ||q−q ||·|Sim(X,q))| p We have constructed various perturbations of the (d) q (e) q (f) q querybasedonwordrepetition,synonymsreplacement, b1 c1 c2 and word deletion. Having fixed a dictionary, the Figure 6: Topic Distributions of Perturbed queries bag of words representation of the query is similarly independent of the set of documents X, all the while using the LDA model. Note though that the semantic representationthroughtheLDA,heavilydependsonX. Table 1: Perturbed query examples Repetition q : USA USA basketball team win world cup Spain a1 q : USA basketball basketball team win world cup Spain a2 q : USA basketball team team win world cup Spain a3 q : USA basketball team win win world cup Spain a4 q : USA basketball team win world world cup Spain a5 q : USA basketball team win world cup cup Spain a6 q : USA basketball team win world cup Spain Spain a7 Replacement Figure 7: Sensitivity numbers q : USA basketball team win FIBA Spain b1 7 Related Work q : USA basketball team win world cup FIBA b2 qb3: USA basketball team win world cup Spain2014 The idea of Data Readiness Levels is closely related Deletion to the field of information retrieval. The main goal qc1: USA team win world cup Spain of traditional information retrieval is to find relevant qc2: USA basketball win world cup Spain documentstoanaturallanguagequery[15]. Thedesired q : USA basketball team win world cup c3 output there is for the most relevant documents to q : USA basketball team win Spain c4 be top-ranked in the returned list of documents. In our research, though, we would like to account for A list of perturbed queries is shown in table 1. The other data modalities such as videos and images. The topic distributions of some of these perturbed queries general definition of DRL would evaluate the value of a are shown in Figure 6. The sensitivity numbers are pieceofdatatoanobjectivewhileinformationretrieval shown in Figure 7: The entries 1−7 correspond to the measures only the relevance of a piece of information queries q −q . The entries 8−10 correspond to the to a query. It is true that retrieved information a1 a7 queries q −q and the entries 11−14 to the queries based on the query should have a higher DRL than b1 b3 q −q . a randomly selected piece of data with respect to c1 c4 In almost all cases, the sensitivity quotient number the same query. In this case, information retrieval s , corresponding to “LDA” in the figure, is smaller techniques, especially question answering, could be 1 than 1. Only for the cases q and q is it bigger than applied to increase the DRL of data in the relevance c1 c2 one. This is because topics 19 (about the world cup dimension and probably coherence dimension. In cases basketball) and 34 (about the USA basketball team) wheretextisthemaindatamedium,DRLisadifferent were ill-proportioned in the perturbed query due to problemthaninformationretrievalinthatDRLfocuses loss of information. One can suggest that this change on the “goodness” of a collection of documents to an (deletion of one word) is small in terms of distance in objective instead of a single document. W but relatively large in the semantic space S. The Technology Readiness Level (TRL) can be thought sensitivity quotient number s is also smaller than 1. of as a template for DRL. TRL is a well-established 2 methodology of estimating the technology maturity financial support. duringtheacquisitionprocesstoassistdecision-making concerning technology funding and transition [16]. The References U.S. Department of Defense (DoD) has defined TRL based on a scale from 1 to 9 with 9 being the most [1] Emmanuel J Candès, Xiaodong Li, Yi Ma, and John mature technology. Unlike TRL, there are no standard Wright. Robust principal component analysis? Jour- evaluationrulesfordatamaturitybutweproposethatit nal of the ACM (JACM), 58(3):11, 2011. isnotpossibletodefineageneralDRLwithoutaspecific [2] LeonardKaufmanandPeterRousseeuw. Clusteringby objective. Forexample,acollectionofdocumentsabout means of medoids. North-Holland, 1987. basketball games has less value to answer a question [3] Brendan J Frey and Delbert Dueck. Clustering by passing messages between data points. science, related to volleyball than to basketball no matter how 315(5814):972–976, 2007. refined these documents are. [4] Zellig S Harris. Distributional structure. In Papers in structural and transformational linguistics, pages 775– 8 Conclusions 794. Springer, 1970. In summary, cosine similarity measures the average [5] David M Blei. Probabilistic topic models. Communi- relation of a document set and a given question, while cations of the ACM, 55(4):77–84, 2012. document disparity reflects the variance or diversity of [6] David M. Blei, Andrew Y. Ng, and Michael I. Jordan. theinformationcontainedinadocumentset. Giventhe Latent dirichlet allocation. J. Mach. Learn. Res., 3:993–1022, March 2003. twomeasures,wewouldfavoradatasetwithlargerSim [7] ThomasHofmann. Probabilisticlatentsemanticindex- measureandlowerDD measure. Suchadatasetwould ing. In Proceedings of the 22nd annual international have a high data readiness level. ACM SIGIR conference on Research and development By a combined theoretical-experimental approach, in information retrieval, pages 50–57. ACM, 1999. we have laid out some groundwork for a viable defini- [8] Daniel D Lee and H Sebastian Seung. Algorithms tionofacomputableDRLmeasure, usableforanyana- for non-negative matrix factorization. In Advances in lyticprocess, especiallyforlargedatasets. Whileheav- neural information processing systems, pages 556–562, ily intuited and illustrated using document-based data, 2001. DRL should be a generic and flexible measure compati- [9] Yun He, A. Ben Hamza, and Hamid Krim. A gen- blewithanydatamodality. Agreatdealofworkclearly eralized divergence measure for robust image regis- remainstobeundertaken,particularlyinaccuratelyes- tration. IEEE Transactions on Signal Processing, 51:1211–1220, 2003. tablishingaquantitativescaleforit,aswellasapplying [10] David A Belsley, Edwin Kuh, and Roy E Welsch. to other modalities such as images, audio signals etc. Regressiondiagnostics: Identifyinginfluentialdataand Webelievethatbyexploringtheideaoftheseman- sourcesofcollinearity,volume571. JohnWiley&Sons, tic space, a concept which is not very well understood, 2005. the linking between various modalities and a specific [11] Quoc V Le and Tomas Mikolov. Distributed repre- question will be achieved, opening the way for compar- sentations of sentences and documents. In ICML, vol- ison and computations on readiness. Here we are not ume 14, pages 1188–1196, 2014. using the standard ideas behind the semantic space as [12] Rishabh Mehrotra, Scott Sanner, Wray Buntine, and the re-formulation of documents through a topic dis- LexingXie. Improvingldatopicmodelsformicroblogs covery model. What we really mean is the creation of via tweet pooling and automatic labeling. In The anabstractspace,containingcognitiveinformationand 36th Annual ACM SIGIR Conference, page 4, Dublin, Ireland, July 2013. associations between ideas, similar to human intellect [13] LiangjieHongandBrianD.Davison. Empiricalstudy and understanding. In this semantic space, the con- oftopicmodelingintwitter. InProceedingsoftheFirst cepts contained in documents, images, and other sig- WorkshoponSocialMediaAnalytics,SOMA’10,pages nals, will have a common reference system enabling us 80–88, New York, NY, USA, 2010. ACM. to link and compare them. Distances in this semantic [14] Radim Rˇeh˚uˇrek and Petr Sojka. Software Framework space will formalize the idea of “notionally close” that for Topic Modelling with Large Corpora. In Proceed- humans naturally possess. ings of the LREC 2010 Workshop on New Challenges for NLP Frameworks, pages 45–50. ELRA, May 2010. Acknowledgment [15] ChengXiang Zhai. Statistical language models for information retrieval. Synthesis Lectures on Human We would like to thank all the members of the Data Language Technologies, 1(1):1–141, 2008. Readiness Level Team especially Dr. Harish Chin- [16] US DoD. Technology readiness assessment (tra) guid- takunta, for meaningful discussions on the subject, and ance. Revision posted, 13, 2011. the Laboratory for Analytics Sciences for its generous