ebook img

JSON: data model, query languages and schema specification PDF

0.37 MB·
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview JSON: data model, query languages and schema specification

JSON: data model, query languages and schema specification Pierre Bourhis Juan L. Reutter Fernando Suárez CNRSCRIStALUMR9189 PUCChileandCenterfor PUCChileandCenterfor SemanticWebResearch SemanticWebResearch Domagoj Vrgocˇ 7 PUCChileandCenterfor 1 SemanticWebResearch 0 2 n a ABSTRACT Web services communicating with their users through J anApplicationProgrammingInterface(API),asJSON Despite the fact that JSON is currently one of the most 9 is currently the predominant format for sending API popular formats for exchanging data on the Web, there are requests and responses over the HTTP protocol. Addi- very few studies on this topic and there are no agreement ] tionally,JSONformatismuchusedindatabasesystems B upon theoretical framework for dealing with JSON. There- builtaroundtheNoSQLparadigm(seee.g. [23,2,27]), D foreinthispaperweproposeaformaldatamodelforJSON or graph databases (see e.g. [33]). documents and, based on the common features present in . Despite its popularity,the coverageofthe specifics of s availablesystemsusingJSON,wedefinealightweightquery c languageallowingustonavigatethroughJSONdocuments. JSON format in the research literature is very sparse, [ and to the best of our knowledge, there is still no We also introduce a logic capturing the schema proposal agreement on the correct data model for JSON, no 1 for JSON and study the complexity of basic computational formalisation of the core query features which JSON v tasks associated with thesetwo formalisms. systems should support, nor a logical foundation for 1 2 JSON Schema specification. And while some prelim- 1. INTRODUCTION 2 inary studies do exist [24, 22, 8, 29], as far as we 2 JavaScript Object Notation (JSON) [18, 12] is a are aware, no attempt to describe a theoretical basis 0 lightweight format based on the data types of the for JSON has been made by the research community. . JavaScript programming language. In their essence, Therefore, the main objective of this paper is to for- 1 0 JSON documents are dictionaries consisting of key- mallydefineanappropriatedatamodelforJSON,iden- 7 valuepairs,wherethevaluecanagainbeaJSONdocu- tify the key querying features provided by the existing 1 ment,thusallowinganarbitrarylevelofnesting. Anex- JSON systems, and to propose a logic allowing us to : ample of a JSONdocument is givenin Figure 1. As we specify schema constraints for JSON documents. v canseehere,apartfromsimple dictionaries,JSONalso In order to define the data model, we examine the i X supports arrays and atomic types such as integers and key characteristics of JSON documents and how are r strings. Arrays and dictionaries can again contain ar- they used in practice. As a result we obtain a tree- a bitraryJSONdocuments,thus makingthe formatfully shaped structure very similar to the ordered data-tree compositional. model of XML [7], but with some key differences. The first difference is that JSON trees are deterministic by { design, as each key can appear at most once inside a "name": { dictionary. This has various implicationsat the time of "first": "John", querying JSONdocuments: onone hand we sometimes "last": "Doe" deal with languages far simpler than XML, but on the }, other hand this key restriction can make static analy- "age": 32, sis more complicated, even for simpler queries. Next, "hobbies": ["fishing","yoga"] arrays are explicitly present in JSON, which is not the } case in XML. Of course, the ordered structure of XML could be used to simulate arrays, but the defining fea- Figure 1: A simple JSON document. ture of each JSON dictionary is that it is unordered, thusdictatingthenodesofourtreetobetypedaccord- Duetoitssimplicity,andthefactthatitiseasilyread- ingly. Andfinally,JSONvaluesareagainJSONobjects, ablebothbyhumansandbymachines,JSONisquickly thus making equality comparisons much more complex becomingoneofthe mostpopularformatsforexchang- than in case of XML, since we are now comparing sub- ing data on the Web. This is particularly evident with trees, and not just atomic values. We cover all of these our JSON documents are only formed by objects, ar- features of JSON in more detail in the paper, and we rays, strings and natural numbers. also argue that, while technically possible (albeit, in a Formally, denote by Σ the set of all unicode charac- very awkwardmanner), coding JSON documents using ters. JSON values are defined as follows. First, any XML might not be the best solution in practice. natural number n ě 0 is a JSON value, called a num- Next, we consider the problemof queryingJSON.As ber. Furthermore, if s is a string in Σ˚, then "s" is there is no agreement upon query language in place, a JSON value, called a string. Next, if v1,...,vn are we examine an array of practical JSON systems, rang- JSON values and s1,...,sn are pairwise distinct string ing from programming languages such as Python [14], values,thents1 :v1,...,sn :vnuisaJSONvalue,called to fully operationalJSONdatabasessuchas MongoDB an object. In this case, each s :v is called a key-value i i [23],andisolatewhatweconsidertobekeyconceptsfor pair of this object. Finally, if v1,...,vn are JSON val- accessing JSON documents. As we will see, the main ues then rv1,...,vns is a JSON value called an array. focus inmany systemsis onnavigatingthe structureof In this case v1,...,vn are called the elements of the a JSON tree, therefore we propose a navigational logic array. Note that in the case of arrays and objects the for JSON documents based on similar approachesfrom values v can again be objects or arrays, thus allowing i the realm of XML [13], or graph databases [3, 21]. We the documents an arbitrary level of nesting. thenshowhowourlogiccapturescommonusecasesfor JSON navigation instructions. Arguably all sys- JSON, extend it with additional features, and demon- tems using JSON base the extraction of information in strate that it respects the “lightweight nature” of the what we call JSON navigation instructions. The nota- JSON format, since it can be evaluated very efficiently, tionusedtospecifyJSONnavigationinstructionsvaries and it also has reasonable complexity of main static from system to system, but it always follows the same tasks. Interestingly, sometimes we can reuse results two principles: devised for other similar languages such as XPath or ‚ If J is a JSON object, then one should be able to Propositional Dynamic Logic, but the nature of JSON access the JSON value in a specific key-value pair and the functionalities present in query languages also of this object. demand new approaches or a refinement of these tech- niques. ‚ If J is a JSON array, then one should be able to Another important aspect of working with any data access the i-th element of J. format is being able to specify the structure of docu- In this paper we adopt the python notation for navi- ments. A usual way to do this is through schema spec- gationinstructions: IfJ is anobject,thenJrkeys isthe ification, and in the case of JSON, there is indeed a value of J whose key is the string value "key". Like- draftproposalfor a schemalanguage[19](calledJSON wise, if J is an array, then Jrns, for a natural number Schema), which has recently been formalised in [29]. n, contains the n-th element of J1. Basedontheformalisationof[29]wedefinealogiccap- As far as we are aware, all JSON systems use JSON turing the full formal specification of JSON Schema, navigation instructions as a primitive for querying showthatitisessentiallyequivalenttothenavigational JSON documents, and in particular it is so for the sys- languagewe proposeforqueryingJSON,andstudy the tems we have reviewed in detail: Python and other complexity of its evaluation and static tasks. However, programming languages [14], the mongoDB database once we take into account the recursive functionalities [23], the JSON Path query language [15] and the of JSON Schema, we arrive at a powerful new formal- SQL++ project that tries to bridge relational and ismthatismuchmoredifficulttoencompassinsidewell JSON databases [25]. known frameworks. At this point it is important to note a few important Finally, since theoretical study of JSON is still in its consequencesofusingJSONnavigationinstructions,as earlystages,weclosewithaseriesofopenproblemsand these have an important role at the time of formalising directions for future research. this framework. Organisation. We formally define JSON and some of First, note that we do not have a way of obtaining itsfeaturesinSection2. Theappropriatedatamodelfor the keys of JSON objects. For example, if J is the JSON is discussed in Section 3, and its query language object{"first":"John", "last":"Doe"}wecouldis- in Section 4. In Section 5 we define a logic capturing sue the instructions Jrfirsts to obtain the value of a schema specification for JSON. Our conclusions and the pair "first":"John", which is the string value the directions for future work are discussed in Section "John". Or use Jrlasts to obtain the value of the pair 6. Due to the lack of space most proofs are placed in "last":"Doe". However, there is no instruction that the appendix to this paper. canretrievethekeysinsidethisdocument(i.e. "first" and "last" in this case). 2. PRELIMINARIES Similarly, for the case of the arrays, the access is es- JSON documents. We start by fixing some nota- sentially a random access: we can access the i-th ele- tion regarding JSON documents. The full JSON spec- ment of an array,and most of the time there are prim- ification defines seven types of values: objects, arrays, itives to access the first or the last element of arrays. strings, numbers and the values true, false and null [9]. 1Some JSON systems prefer using a dot notation, where However, to abstract from encoding details we assume J.key and J.n are theequivalentsof Jrkeys and Jrns. However,we cannot reasonabout differentelements of The rootoftree representsthe entire document. The the array. For example, for the array K “ [12,5,22] twoedgeslabelled"name"and"age"representtwokeys we can not retrieve, say, an element (or any element) inside this JSON object, and they lead to nodes repre- which is greater than the first element of K. Some sys- senting their respective values. In the case of the key tems do featureFLWR expressionsthatsupportiterat- "age"thisisjustaninteger,whileinthecaseof "name" ing overallelements. But this iterationis itself treated we obtain another JSON object that is represented as asaseriesofrandomaccesses,usingcommandssuchas a subtree of the entire tree. For i in (0,n) print(J[i]). Finally,weneedtoenforcethepropertyofJSONthat no object can have two keys with the same name, thus 3. DATAMODELFORJSON making the model deterministic in some sense, since eachnodewillhaveonlyonechildreachablebyanedge In this section we propose a formal data model for withaspecificlabel. Letusbrieflysummarisetheprop- JSON documents whose goal is to closely reflect the erties of our model so far. mannerinwhichJSONismanipulatedusingJSONnav- igationinstructions,andthatwillbeusedlateronasthe Labelled edges. Edges in our model are labelled by the basis of our formalisation of JSON query and schema keysformingthekey-valuepairsofobjects. Thismeans languages. We begin by introducing our formal model, thatwecandirectlyfollowthelabelofedgeswhenissu- calledJSONtrees. Afterwardswe discuss the maindif- ing JSON navigationinstructions, and also means that ferencesbetweenJSONtreesandotherwell-studiedtree informationofkeysisrepresentedinadifferentmedium formalisms such as data trees or XML. than JSON values (labels for the former, nodes for the latter). This is inline with the way JSON navigation 3.1 JSON trees instructions work, as one can only retrieve values of JSON objects are by definition compositional: each key-value pairs, but not the keys themselves. To com- JSONobjectis a setofkey-valuepairs,inwhich values ply with the JSON standard,we disallow trees where a can again be JSON objects. This naturally suggests same edge label is repeated in two different edges leav- using a tree-shaped structure to model JSON docu- ing a node. ments. However,this structure must preserve the com- Compositional structure. One of the advantages of our positional nature of JSON. That is, if each node of the tree representation is that any of its subtrees represent tree structure represents a JSON document, then the a JSONdocumentthemselves. Infact,the fivepossible children of each node must represent the documents subtrees of the tree above correspondto the five JSON nested within it. For instance, consider the following values present in the JSON J. JSON document J. Atomic values. Finally,someelementsofaJSONdocu- { ment are actual values, such as integers or strings. For "name": { this reason leaf nodes corresponding to integers and "first": "John", "last": "Doe" strings will also be assigned a value they carry. Leaf }, nodes withouta valuerepresentempty objects: thatis, "age": 32 documents of the form {}. } Althoughthismodelissimpleandconceptuallyclear, as explained before, this document is a JSON object we are missing a way of representing arrays. Indeed, which contains two keys: "name" and "age". Further- consider again the document from Figure 1 (call this more, the value of the key "name" is another JSON document J2). In J2 the value of the key "hobbies" document and the value of the key "age" is the inte- is an array: another feature explicitly present in JSON ger 32. There are in total 5 JSON values inside this that thus needs to be reflected in our model. object: the complete document itself, plus the liter- Asarraysareordered,thismightsuggestthatwecan als 32, "John" and "Doe", and the object "name": have some nodes whose children form an ordered list {"first":"John", "last":"Doe"}. So how should a of siblings, much like in the case of XML. But this tree representation of the document J look like? If we would not be conceptually correct, for the following are to preserve the compositional structure of JSON, two reasons. First, as we have explained, JSON nav- then the most natural representation is by using the igation instructions use random access to access ele- following edge-labelled tree: ments in arrays. For example, the navigation instruc- tion used to retrieve an element of an array is of the formJ2rhobbiessris,aimedatobtainingthei-thelement "name" "age" ofthearrayunderthekey"hobbies". Butmoreimpor- tantly, we do not want to treat arrays as a list because lists naturally suggest navigating through different ele- 32 ments of the list. On the contrary,none of the systems "first" "last" we reviewed feature a way of navigating form one el- ement of the array to another element. That is, once we retrieve the first element of the arrayunder the key "John" "Doe" "hobbies",we have no way of linking it to its siblings. WechoosetomodelJSONarraysasnodeswhosechil- importantwhenmodellingschemadefinitionsforJSON, drenareaccessedbyaxeslabelledwithnaturalnumbers as we shall show later. reflecting their position in the array. Namely, in the Throughout this paper we will use the term JSON case of JSON document J2 above we obtain the follow- tree and JSON interchangeably. As already mentioned ing representation: above,oneimportantfeatureofourmodelisthatwhen lookingatanynodeofthetree,asubtreerootedatthis nodeisagainavalidJSON.Wecanthereforedefine,for "name" "hobbies" a JSON tree J and a node n in J, a function jsonpnq "age" which returns the subtree of J rooted at n. Since this subtree is again a JSON tree, the value of jsonpnq is 32 always a valid JSON. "first" "last" 1 2 3.2 JSON andXML Before continuing we give a few remarks about dif- "John" "Doe" "fishing" "yoga" ferences and similarities between JSON and XML, and howarethesereflectedintheirunderlyingdatamodels. Having arrays defined in this way allows us still to We start by summarising the differences between the treat the child edges of our tree as navigational axes: two formats. beforeweusedakeysuchas"age"totraverseanedge, 1. JSON mixes ordered and unordered data. JSON and now we use the number labelling the edge to tra- objects are completely without order, but for ar- verse it and arrive at the child. rays we can do random access depending on their Formal definition. Asourmodelisatree,wewilluse position.Ontheotherhand,XMLenforcesastrict treedomainsasitsbase. Atreedomainisaprefix-closed order between the children of each node. Cod- subsetofN˚. Withoutlossofgeneralityweassumethat ing JSONasXML wouldimply permitting sibling for all tree domains D, if D contains a node n¨i, for traversalforsomenodes,butdisallowingitforoth- nPN˚ then D contains all n¨j with 0ďj ăi. ers. We can do that with XML with some ad-hoc Let Σ be an alphabet. A JSON tree over Σ is a rules,butthisispreciselywhatwedoinourmodel structure J “ pD,Obj,Arr,Str,Int,A,O,valq, where D in a much cleaner way. is a tree domain that is partitioned by Obj, Arr, Str 2. JSONArraysareneitherlistsnorsets. Aswehave and Int, O ĎObjˆΣ˚ˆD is the object-childrelation, explained, we have random access, but we do not AĎArrˆNˆD is the array-childrelation, val:StrY have the possibility of sibling traversal. Enforcing IntÑΣ˚YN is the string and number value function, this in languagessuch as XPathis a verycumber- and where the following holds: some task. 1 For each node n P Obj and child n ¨ i of n, O 3. JSON trees are deterministic. The property of contains one triple pn,w,n¨iq, for a wordw PΣ˚. JSONtreewhichimposesthatallkeysofeachob- 2 The first two components of O form a key: if ject have to be distinct makes JSON trees deter- pn,w,n¨iq and pn,w,n¨jq are in O, then i“j. ministicinthesensethatifwehavethekeyname, there can be at most one node reachable through 3 For each node n P Arr and child n ¨ i of n, A anedgelabelledwiththiskey. Ontheotherhand, contains the triple pn,i,n¨iq. XML treesarenondeterministicsincethereareno 4 If n is in Str or Int then D cannot contain nodes labels on the edges, and a node can have multiple of form n¨u. children. As we will see, the deterministic nature 5 The value function assigns to each string node in of JSON can make some problems more difficult Str a value in Σ˚ and to each number node in Int than in the XML setting. a natural number, 4. Value is not just in the node, but is the entire sub- treerooted at that node. Anotherfundamentaldif- The usage of a tree domain is standard, and we have ference is that in XML when we talk aboutvalues electedtoexplicitlypartitionthedomainintofourtypes we normally refer to the value of an attribute in of nodes: Obj for objects, Arr for arrays,Str for strings a node. On the contrary, it is common for sys- and Int for integers. The first and second conditions tems using JSON to allow comparisons of the full specify that edges between objects and their children subtree of a node with a nested JSON document, are labelled with words,but we can only use eachlabel or even comparing two nodes themselves in terms one time per each node. The third condition specifies of their subtrees. To be fair, in XML one could that the edges between arrays and their children are also argue this to be true, but unlike in the case labelledwiththe numberrepresentingthe orderofchil- ofXML,these“structural”comparisonsareintrin- dren. The fourth condition simply states that strings sic in most JSON languages, as we discuss in the and numbers must be leaves in our trees, and the fifth following sections. conditiondescribesthevaluefunctionval. Notethatwe haveexplicitlydistinguishedthefourtypeofJSONdoc- On the other hand, it is certainly possible to code uments (objects, arrays, strings and integers). This is JSON documents using the XML data format. In fact, themodeloforderedunrankedtreeswithlabelsandat- MongoDB’s find function. The basic querying tributes, which serves as the base of XML, was shown mechanism of MongoDB is given by the find function to be powerful enough to code some very expressive [23], therefore we focus on this aspect of the system2. database formats, such as relational and even graph The find function receives two parameters, which are data. However, both models have enough differences both JSON documents. Given a collection of JSON to justify a study of JSON on its own. This is par- document and these parameters,the find function then ticularly evident when considering navigation through produces an array of JSON documents. JSONdocuments,wherekeysineachobjecthaveto be The first parameter of the find function serves as a unique,thusallowingustoobtainvaluesveryefficiently. filter, and its goal is to select some of the JSON doc- Ontheotherhand,codingJSONasXMLwouldrequire uments from the input. The second parameter is the ustohavekeysasnodelabels,thusforcingascanofall projection, and as its name suggests, is used to specify of the node’s children in order to retrieve the value. whichpartsofthefiltereddocumentsaretobereturned. Since ourgoalis specifyinga navigationallogic,we will 4. NAVIGATIONALQUERIESOVERJSON only focus on the filter parameter, and on find queries thatonlyspecify thefilter. We returntothe projection As JSON navigation instructions are too basic to inSection6. Formoredetailswereferthereadertothe serve as a complete query language,most systems have current versionof the documentation [23]. developeddifferent ways ofquerying JSONdocuments. The basic building block of filters are what we call Unfortunately, there is no standard, nor general guide- navigation condition,whichcanbevisualisedasexpres- lines, about how documents are accessed. As a re- sionsofthe formP „J,whereP is aJSONnavigation sult the syntax and operations between systems vary instruction, „ is a comparison operator (MongoDB al- so much that it would be almost impossible to com- lows all the usual ă, ď, “ , ě, ą, and several others pare them. Hence, it would be desirable to identify operators) and J is a JSON document. a common core of functionalities shared between these systems,oratleastageneralpictureofhowsuchquery Example 1. Assume that we are dealing with languages look like. Therefore we begin this section by a collection of JSON files containing informa- reviewingthemostcommonoperationsavailableincur- tion about people and that we want to obtain the rent JSON systems. one describing a person named Sue. In Mon- Here we mainly focus on the subdocument selecting goDB this can be achieved using the following query functionalities of JSON query languages. By subdocu- db.collection.find({name: {$eq: "Sue"}},{}). mentselectingwemeanfunctionalitiesthatarecapable The initial part db.collection is a system path to of finding or highlighting specific parts within JSON find the collection of JSON documents we want to documents, either to be returned immediately or to be query. Next, "name" is a simple navigation instruction combinedas new JSONdocuments. As our workis not used to retrieve the value under the key "name". intendedtobe asurvey,wehavenotreviewedallpossi- Last, the expression {$eq: "Sue"} is used to state blesystemsavailabletodate. However,wetakeinspira- that the JSON document retrieved by the navigation tion from MongoDB’s query language (which arguably instruction is equal to the JSON "Sue". Since we are hasservedasabasisformanyothersystemsaswell,see not dealing with projection, the second parameter is e.g. [2, 27, 32]), and JSONPath [15]and SQL++ [25], simply the empty document {}. Using the notation two other query languages that have been proposed by above we could also write this navigation condition as the community. Jrnames“"Sue". Based on this, we propose a navigational logic that can serve as a common core to define a standard way Finally, navigationconditions can be combinedusing ofqueryingJSON.We then define severalextensionsof boolean operations with the standard meaning. Also thislogic,suchasallowingnondeterminismorrecursion, note that filters always return entire documents. If we andstudy how theseaffectbasic reasoningtasksuchas want a part of a JSON file we need to use filters. evaluation and containment. QuerylanguagesinspiredbyXPath orrelational expressions. The languages we analysed thus far of- 4.1 Accessing documents inJSONdatabases fer very simple navigational features. However, people Here we briefly describe how JSON systems query alsorecognizedtheneedtoallowmorecomplexproper- documents. ties such as nondetermnistic navigation, expression fil- QuerylanguagesinspiredbyFLWRorrelational ters and allowing arbitrary depth nesting through re- expressions. There are several proposals to construct cursion. As a result, an adaptation of the XML query query languagesthatcanmerge,joinandevenproduce language XPath to the context of JSON, called JSON- new JSON documents. Most of them are inspired ei- Path [15] was introduced and implemented (see e.g. ther by XQuery (such as JSONiq [30]) or SQL (such https://github.com/jayway/JsonPath). as SQL++ [25]). These languages have of course a lot Based on these features, we first introduce a logic of intricate features, and to our best extent have not capturing basic queries providedby navigationinstruc- beenformallystudied. However,intermsofJSONnav- 2For a detailed study of other functionalities MongoDB of- igation thy all seem to support basic JSON navigation fers see e.g. [8]. Note that this work does not consider the instructions and not much more. findfunction though. tions and conditions, and then extend it with non- ‚ JJK “D. J determinism and recursion resulting in a logic resem- bling similar approaches over XML. ‚ J ϕKJ “D´JϕKJ. 4.2 Deterministic JSONlogic ‚ Jϕ^ψK “JϕK XJψK . J J J ThefirstlogicweintroduceismeanttocaptureJSON navigationinstructionsandotherdeterministicformsof ‚ Jϕ_ψKJ “JϕKJ YJψKJ. queryingsuchasMongoDB’sfindfunction. Wecallthis ‚ JrαsK “ tn | n P D and there is a node n1 in D logic JSON navigation logic, or JNL for short. We be- J such that pn,n1qPJαK u lieve that this logic, although not very powerful, is in- J terestinginits ownright,asitleadstoverylightweight algorithms and implementations, which is one of the ‚ JEQpα,AqKJ “ tn | n P D and there is a node n1 aims of the JSON data format. in T such that pn,n1qPJαKJ and jsonpn1q“Au As often done in XML[13] and graph data[21], we ‚ JEQpα,βqK “ tn | n P D and there are nodes define ours in terms of unary and binary formulas. J n1,n2inT suchthatpn,n1qPJαKJ,pn,n2qPJβKJ, Definition 1 (JSON navigational logic). and jsonpn1q“jsonpn2qu. Unary formulas ϕ,ψ and binary formulas α,β of the Typically,mostsystemsallowjumpingtothelastele- JSON navigational logic are expressions satisfying the mentofanarray,orthethe j-thelementcountingfrom grammar the last to the first. To simulate this we can allow bi- α,β :“ xϕy | X | X | α˝β | ε nary expressions of the form X , for an integer i ă 0, w i i ϕ,ψ :“ J | ϕ | ϕ^ψ | ϕ_ψ | rαs | where ´1 states the last position of the array, and ´j states the j-th position starting from the last to the EQpα,Aq | EQpα,βq first. Having this dual operator would not change any where w is a word in Σ˚, i is a natural number and A of our results, but we prefer to leave it out for the sake is an arbitrary JSON document. of readability. Algorithmic properties of JNL. As promised, here Intuitively,binaryoperatorsallowustomovethrough weshowthatJNLisalogicparticularlywellbehavedfor thedocument(theyconnecttwonodesofaJSONtree), databaseapplications. Forthis westudythe evaluation andunaryformulascheckwhether apropertyis trueat problemandsatisfiabilityproblemassociatedwithJNL. some point of our tree. For instance, X and X allow w i The Evaluation problem asks, on input a JSON J, a basic navigation by accessing the the value of the key JNLunaryexpressionϕandanodenofJ,whethernis named w, or the ith element of an array respectively. in JϕK . The Satisfiability problem asks, on input a They can subsequently be combined using composition J JNLexpressionϕ,whetherthereexistsaJSONJ such orbooleanoperationstoformmorecomplexnavigation thatJϕK isnonempty. Westartwithevaluation,show- expressions. Unaryformulasserveastestsifsomeprop- J ing that JNL indeed matches the“lightweight”spirit of ertyholds atthe partofthe documentwe arecurrently the JSONformat and canbe evaluatedvery efficiently: reading. These also include the operator rαs allowing us to test if some binary condition is true starting at a Proposition 1. The Evaluationproblem for JNL current node (similarly, xϕy allows us to combine node can be solved in time Op|J|¨|ϕ|q. tests with navigation). Finally, the comparison opera- tors EQpα,Aq and EQpα,βq simulate XPathstyle tests For this result, we can reuse techniques for XPath which check whether a current node can reach a node whose value is A, or if two paths can reach nodes with evaluation (see e.g. [28, 16]). However, the presence of the EQpα,βq operator forces us to refine these tech- the same value. The difference from XML though, is niques in a non-trivial way. A straightforward way of that this value is again a JSON document and thus a incorporating this predicate into XPath algorithms is subtree of the original tree. to pre-processallpairsofnodes tosee whichpairshave The semantics of binary formulas is given by the re- lation JαK , for a binary formula α and a JSON J, and equal subtrees, but this only gives us a quadratic algo- J it selects pairs of nodes of J: rithm. Instead, we transform our JNL formula into an equivalentnonrecursivemonadicdatalogprogramwith ‚ JxϕyKJ “tpn,nq|nPJϕKJu. stratified negation [17], and show how to evaluate the latter by doing equality comparisons “online” as they ‚ JX K “tpn,n1q|pn,w,n1qPOu. w J appear. ‚ JX K “tpn,n1q|pn,i,n1qPAu, for iPN. Next,wemovetosatisfiability,showingthatthecom- i J plexityoftheproblemis bestpossible,consideringthat ‚ Jα˝βKJ “JαKJ ˝JβKJ. JNL can emulate propositionalformulas. ‚ JεK “tpn,nq|n is a node in Ju. J Proposition 2. The Satisfiability problem for For the semantic of the unary operators, let us assume JNL is NP-complete. It is NP-hard even for formu- that D is the domain of J. las not using negation nor the equality operator. It might be somewhat surprising that the positive Proposition 4. The Satisfiabilityproblemisun- fragmentwithoutdatacomparisonsisnottriviallysatis- decidable fornon-deterministicrecursiveJNLformulas, fiable. Thisholdsduetothefactthateachkeyinanob- even if they do not use negation. jectisunique,soaformulaoftheformXarX1s^XarXbs is unsatisfiable because it forces the value of the key a However, if we rule out the equality operator we to be both an array and a string at the same time. can show much better bounds. For the full non- deterministic, recursive JNL (without equalities) the 4.3 Extensions satisfiability problem is the same as other similar Although the base proposalfor JNL captures the de- fragments such as PDL. For (non-recursive) non- terministic spirit of JSON, it is somewhat limited in deterministic JNL the problem is slightly easier. expressive power. Here we propose two natural exten- sions: the ability to non-deterministically select which Proposition 5. The Satisfiability problem is: child of a node is selected, and the ability to traverse ‚ Pspace-complete for non-deterministic, non- paths of arbitrary length. recursive JNL without the EQpα,βq operator. Non-determinism. The path operators X and X w i canbeeasilyextendedsuchthattheyreturnmorethan ‚ Exptime-complete for non-deterministic, recur- a single child; namely, we can permit matching of reg- sive JNL without the EQpα,βq operator. ular expressions and intervals, instead of simple words and array positions. Note that Pspace-hardness for satisfiability follows Formally, non-deterministic JSON logic extends bi- easily from the fact that we are now allow regular ex- nary formulas of JNL by the following grammar: pressions in our edges: Given a regular expression e, we have that the e is universal if and only if the query α,β :“ xϕy | Xe | Xi:j | α˝β | ε rXΣ˚s^ rXes is not satisfiable. However,in the proof whereeisasubsetofΣ˚(givenasaregularexpression), of this proposition we in fact show that the problem and i ď j are natural numbers, or j “ `8 (signifying remains Pspace-hard even when the only regular ex- thatwewantanyelementofthearrayfollowingi). The pression which is not a word in a X axis is Σ˚. One e semantics of the new path operators is as follows: canalsoshowthat Pspace-hardnessremainswhenone ‚ JX K “ tpn,n1q | there is w P Lpeq such that only considers JSON documents without object values. e J pn,w,n1qPOu. 5. SCHEMA DEFINITIONSFORJSON ‚ JXi:jKJ “ tpn,n1q | there is i ď p ď j such that the triple pn,p,n1q is in Au. Having dealt with navigational primitives for query- ing JSON, our next task is to analyse JSON Schema Recursion. In order to allow exploring paths of arbi- definitions. We focus solely on the JSON Schema spec- trary length we add the Kleene star to our logic. That ification [18], which is, up to our best knowledge, the is, recursive JNL allows pαq˚ as a binary formula (as only attempt to define a general schema language for usual we normally omit the brackets when the prece- JSON documents. The JSON Schema specification is dence of operators is clear). The semantics of pαq˚ is currently in its fourth draft, and on its way of becom- given by ing an IETF standard. Jpαq˚K“JεKJ YJαKJ YJα˝αKYJα˝α˝αKJ Y.... 5.1 JSON Schema So what happens to the evaluation and containment As before, we first briefly present how JSON Schema when we extend this logic? For the case of evaluation, works. We remark again that our intention is not to we caneasilyshowthatthe linearalgorithmis retained provide a full analysis for the specification, but rather as long as we do not have the binary equality opera- showhowthenavigationworks,withtheaimofobtain- tor EQpα,βq. Indeed, in this case, the evaluation can ing a logic that can capture JSON Schema. We thus be done using the classical PDL model checking algo- concentrateonacorefragmentthatisequivalenttothe rithm [1, 10] with small extensions which account for full specification;we refer to [29] formore details anda the specifics of the JSON format. However, we are not full formalisation of this core. able to extend the linear algorithm for the full case, EveryJSONschema is JSON document itself. JSON because an expression of the form EQpα,βq might re- Schema can specify that a document must be any of quirecheckingallpairsofnodesinourtreeforequality, the different types of values (objects, arrays,strings or resulting in a jump in complexity. numbers); and for each of these types there are several keywords that help shaping and restricting the set of Proposition 3. The evaluation problem for JNL documentsthataschemaspecifies. Themostimportant with non-determinism and recursion can be solved in time Op|J|3¨|ϕ|q, and in time Op|J|¨|ϕ|q if the formula keywordisthe”type”keyword,asitdeterminesthetype of value that has to be validated against the schema: a does not use the predicate EQpα,βq. documentoftheformt"type":"string", ...uspecifies Forsatisfiabilitythesituationisradicallydifferent,as stringvalues,t"type":"number", ...uspecifiesnumber the combinationof recursion,non-determinismand the values, t"type":"object", ...u specifies objects and binary equalities ends up being too difficult to handle. t"type":"array", ...u specifies arrays. In addition to Keywords for string schemas: Keywords for object schemas: - "type":"string" - "pattern": exp - "type":"object" - "required":r k1,...,kns - "minProperties": i - "maxProperties": i Keywords for number schemas: - "properties":tk1 :J1,...,km :Jmu - "type":"number" -"multipleOf":i - "patternProperties":t"e1":J1,...,"eℓ":Jℓu - "minimum":i - "maximum":i - "additionalProperties":J Keywords for array schemas: Boolean combination and comparisons: - "items":rJ1,...,Jns - "anyOf":r J1,...,Jns - "allOf":r J1,...,Jms - "uniqueItems":true - "additionalItems":J - "not":J - "enum": r A1,...,Ans Table 1: The form for all keyords in JSON schema. Here i is always a natural number, J and each J are JSON i schemas, A1,...,An are JSON documents, each ki is a string value (k stand for key); and exp and each expi are regular expressions over the alphabet Σ of strings. the type keyword, each schema includes a number of neither in properties nor conform to the language of other pairs that shape the documents they describe. anexpressioninpatternProperties. Forexample,the We now describe each of the four types of basic followingschemaspecifiesobjectswherethevalueunder schemas. Table 1 contains a list of all keywords avail- ”name”mustbeastring,thevalueunderanykeyofthe able for each of these schemas. form a(b|c)a must be an even number, and the value under any key which is neither ”name”nor conforms to String schemas. String schemas are those featuring the expression above must always be the number 1. the "type":"string" pair. Additionally, they may { include the pair "pattern":"regexp", for regexp a "type": "object", regular expression over Σ, which validates only against "properties": { those strings that belong to the language of this "name": {"type":"string"}, expression. For example, t"type":"string"u and }, t"type":"string","pattern":"p01q`"u are string "patternProperties: { "a(b|c)a": {"type":"number", "multipleOf": 2} schemas. The firstschema validates againstany string, }, and the second only against strings built from 0 or 1. "additionalProperties: { "type": "number", Number schemas. For numbers, we can use the "minimum": 1 pair "minimum":i to specify that the number is at "maximum": 1 least i, "maximum":i to specify that the number is } at most i, and "multipleOf":i to specify that a } number must be a multiple of i. Thus for example Array schemas. Arrayschemasarespecifiedwiththe t"type":"number","maximum":12,"multipleOf":4u "type": "array" keyword. For arrays there are two describes numbers 0, 4, 8 and 12. ways of specifying what kind of documents we find in arrays. Wecanuseapair"items":rJ1,...,Jnstospec- Object schemas. Besides the "type":"object"pair, ifyadocumentwithanarrayofnelements,whereeach object schemas may additionally have the following: i-th element must satisfy schema J . We can also use i - Pairs "minProperties":i and "maxProperties":j, "additionalItems":J to specify that all elements in to specify that an object has to have at least i and/or the array must satisfy schema J. If both keywords are at most j key-value pairs. usedtogether,thenweallowthearraytohavemoreval- uesthanthosespecifiedwithitems,aslongastheyagree - a pair "required":rk1,...,kns, where each ki is a withtheschemaspecifiedinadditionalItems. Finally, string value. This keywordmandates that the specified one can include the pair "uniqueItems":true to force object values must contain pairs with keys k1,...,kn. arrays whose elements are all distinct form each other. - a pair "properties":tk1 : J1,...,km : Jmu, where Forexample,the followingschemavalidatesagainstar- each k a string value and each J is itself a JSON i i raysofatleast2elements,wherethefirsttwoarestrings Schema. This keyword states that the value of each and the remaining ones, if they exists, are numbers. pair with key k must validate against schema J . i i -Apair"patternProperties":t"e1":J1,...,"eℓ":Jℓu, { where each e is a regular expression over Σ and each "type": "array", i "items": [{"type":"string"}, {"type":"string"}], J is a JSON schema. This keyword works just like i "additionalItems": {"type":"number"}, properties, but now any value under any key that "uniqueItems":true conforms to the expressionexpi must satisfy Ji. } -finally,thepair"additionalProperties":J,whereJ isaJSONschema. Thiskeywordpresentsaschemathat Boolean combinations. The last feature in JSON mustbesatisfiedbyallvalueswhosekeysdonotappear Schema are boolean combinations. These allow us to specify that a document must validate against two we want any element of the array following i)and A is schemas, against at least one schema, or that it must an arbitrary JSON document. notvalidateagainstaschema. Forexample,theschema "not":{"type":"number","multipleOf":2}validates As with JSON navigational logic, we can also obtain againstany oddnumber,orany documentwhichis not a deterministic versionof JSL by restricting the syntax ✸ a number. t✸o use only modal operators lw and li, and w and ; for a word wPΣ˚ and a natural number i. i 5.2 JSON Schema Logic The semantics is given by extending the relation |ù. In order to capture the JSON Schema specification - pJ,nq|ùJ for every node n in J. with a logical formalism, we isolate navigation and - pJ,nq|ù ϕ iff. pJ,nq*ϕ. atomic tests into two different sets of operators. Let us start with atomic operations, which are basically - pJ,nq|ùϕ^ψ iff. pJ,nq|ùϕ and pJ,nq|ùψ. a rewriting of most JSON Schema keywords into our - pJ,nq|ùϕ_ψ iff. either pJ,nq|ùϕ or pJ,nq|ùψ. framework. We allocate them in the set NodeTests. ✸ -pJ,nq|ù ϕiff. thereisawordw PLpeqandanode Formally,NodeTestscontainsthepredicatesArr,Obj, e n1 in J such that pn,w,n1qPO and pJ,n1q|ùϕ Str,IntandUnique,plusapredicatePatternpeqforeach ✸ regularexpressionebuiltfromΣ,predicatesMinpiqand - pJ,nq |ù i:j ϕ iff. there is i ď p ď j and a node n1 Maxpiqforeachintegeri,apredicateMultOfpiqforeach in J such that pn,p,n1qPA and pJ,n1q|ùϕ iě0,predicatesMinChpkqandMaxChpkqforeachk ě - pJ,nq |ù l ϕ iff. pJ,n1q |ù ϕ for all nodes n1 such e 0andapredicate„pAqforeachJSONdocumentA. The that pn,w,n1qPO for some wPLpeq. semantics of these predicates is given by the relation |ù, that states whether an atomic predicate holds for a - pJ,nq |ù li:j ϕ iff. pJ,n1q |ù ϕ for all nodes n1 such that pn,p,n1qPA for some iďpďj. given node n of a JSON J. In order to present our results regarding JSL and -pJ,nq|ùArriff. nPArr. -pJ,nq|ùObjiff. nPObj. JSON Schema, we abuse notation and write J |ù ψ - pJ,nq|ùStr iff. nPStr. - pJ,nq|ùInt iff. nPInt. whenever pJ,rq|ùψ, where r is the root of J. - pJ,nq|ùPatternpeq iff. valpnq is a string in Lpeq. Expressive power. As promised we show that JSL - pJ,nq|ùMinpiq iff. valpnq is a number greater than i. can capture JSON schema. In order to present this - pJ,nq|ùMaxpiq iff. valpnq is a number smaller thani. result we informally speak of the validation relation of JSONSchema,andsaythataJSONS validatesagainst - pJ,nq|ùMultOfpiq iff. valpnq is a multiple of i. J wheneverJ isinaccordancetoallkeywordspresentin - pJ,nq|ùMinChpiq iff. n has at least i children. S. Wereferto[29]formoredetailsonthesemantics. As - pJ,nq|ùMaxChpiq iff. n has at most i children. usual,wesaythatJSONSchemaandJSLareequivalent in expressive power if for any JSON Schema S there - pJ,nq |ù Unique iff. n P Arr and all of its chil- exists a JSL formula ψ PL such that for every JSON dren are different JSON values: for each n1 ‰ n2 such S that pn,p,n1q and pn,q,n2q belong to A we have that document J we have that J validates against S if and only if J |ù ψ ; and conversely, for any expression ϕ P jsonpn1q‰jsonpn2q. S L there exists a JSON Schema S such that for every ϕ - pJ,nq|ù „pAq iff. jsonpnq“A. JSON document J we have that J |ùψ if and only if J WithNodeTestswecoverallatomicfeaturesofJSON validates against S. Schema. All that remains is the navigation, which in JSON Schema is given by the keywords properties, Theorem 1. JSL and JSON Schema are equivalent patternProperties, additionalProperties and required, in expressive power. for objects, and items and additionalItems for arrays. These forms of navigation are suggest using existen- ComparingJSLandJNL.Next,weconsiderhowthe tial and universal modalities. For instance, the key- navigation logic of Section 4 compares to the schema wordpatternPropertiesspecifies aschemathatmust logic JSL. Even though the starting point of the two be validated by all values whose keys satisfy a regular logics is different, we next show that the two logics are expression, and "required": rws demands that there essentially the same, their expressivity differing simply must exist a children with key w. Thus, to finish our because of the different atomic predicates. More pre- logic, we augment our node tests with universal and cisely, we have: existential modalities, as well as booleancombinations. Theorem 2. Non-deterministic JNL not using the Definition 2. Formulas in the JSON schema logic equalityEQpα,βqandnon-deterministicJSLusingonly (JSL) are expressions satisfying the grammar thenodetest„pAqareequivalentintermsofexpressive power. More precisely: ϕ,ψ :“ J | ϕ | ϕ^ψ | ϕ_ψ | ψ PNodeTests | leϕ | li:jϕ | ✸eϕ | ✸i:jϕ ‚ For every formula ϕS in JSL there exists a unary formula ϕN in JNL such that for every JSON J: whereeisasubsetofΣ˚ (givenasaregularexpression), i ď j are natural numbers, or j “ `8 (signifying that JϕNKJ “tnPJ |pJ,nq|ùϕSu. ‚ For every unary formula ϕN in JNL there exists a { ϕS in JSL such that for every JSON J: "definitions": { "email": { JϕNK “tnPJ |pJ,nq|ùϕSu. "type": "string", J "pattern": "[A-z]*@ciws.cl" Intheproofabove,wealsoshowthatgoingfromJSL } }, to JNL takes only polynomial time, while the transi- "not": {"$ref": "#/definitions/email"} tion in the other direction can be exponential. This } implies that the upper bounds for JNL are valid for As we have mentioned, different schemas are de- JSL, while the lower bounds transfer in the opposite fined under the definitions section of the JSON direction (without taking into account node tests). document, and these definitions can be reused us- Algorithmic Properties. SinceJSLisdesignedtobe ing the $ref keyword. In the example above, we use {"$ref": "#/definitions/email"} to retrieve a schema logic to validate trees, we specify a boolean the schema {"type": "string", "pattern": "[A- Evaluation problem: the input is a JSON J and a z]*@ciws.cl"}. Thesedefinitionscanbenestedwithin JSL expressionϕ, and we decide whether J |ùϕ. eachother,butsemanticsiscurrentlydefinedonlyfora Proposition 6. The Evaluation problem for JSL fragment with limited recursion. We come back to this can be solved in time Op|J|2 ¨|ϕ|q, and in Op|J|¨|ϕ|q issue after we define a logic capturing these schemas. when ϕ does not use the uniqueItems keyword. Recursive JSL.Theideaofthislogicistocapturethe FromTheorem1weobtainasacorollarythattheval- recursivefunctionalitiespresentinJSONSchema: there idationproblemforJSONSchemahasthesamebounds. is a special section where one can define new formulas, This was already shown in [29]. Next, we study the which can be then re-used in other formulas. Satisfiability problem, which receives a formula ϕ Fix an infinite set Γ“ tγ1,γ2,γ3,...u of symbols. A as input and consists of deciding whether there is any recursive JSL formula is an expression of the form JSON tree J such that J |ù ϕ. Here we need to be γ1 “ ϕ1 careful with the encoding we choose, as the interplay ✸ γ2 “ ϕ2 between Unique and J immediately forces a node i . with exponentially many different children when i is . . giveninbinary. Intermsofresults,thismeansthatour γ “ ϕ algorithms raise by one exponential, although we don’t m m know if this increase is actually unavoidable. ψ (1) Proposition 7. The Satisfiability problem for whereeachγi isasymbolinΓandϕi,ψareJSLformu- JSL is in Expspace, and Pspace-complete for expres- las over the extended syntax that allows γ1,...,γm as sions without Unique. atomic predicates. Here each equality γi “ ϕi is called a definition, and ψ is called the base expression. We remark that the Satisfiability problem is im- The intuition is that each γ is one of the references i portant in the context of JSON Schema. For example, of JSON Schema. Before moving to the semantics, let the community has repeatedly stated the need for al- us show the way recursive JSL formulas work. gorithmsthatcanlearnJSONSchemasfromexamples. We believe that understanding basic tasks such as sat- Example 2. Consider the expression ∆ given by isfiabilityarethefirststepstoproceedinthisdirection. γ1 “lΣ˚γ2 ✸ 5.3 Adding recursion γ2 “p Σ˚Jq^plΣ˚γ1q γ1 The JSON Schema specification also allows defining statements that will be reused later on. We have so far Intuitively, γ1 holds in a node n if n either has no chil- ignoredthisfunctionality,andtocaptureitwewillneed dren, or if γ2 holds all of its its children. On the other todefine thesameoperatorinourlogic. Aswewillsee, hand, γ2 holds in a node n if this node has at least one this lifts the expressive power of JSL away from even child, and γ1 holds in all children of n. Finally, the the recursive version of our navigational logic; and is base expression simply states that γ1 has to hold in the very similar to certain forms of tree automata. root of the document (recall that a schema statement is LetusexplainfirsthowrecursionisaddedintoJSON evaluatedattheroot). Theintuitionfor ∆is, then, that Schema. The idea is to allowto anadditionalkeyword, it should hold on every treesuchthat each pathfrom the of the form {$ref: <path>}, where <path> is a navi- root to the leaves is of even length. gation instruction. This instruction is used within the Well-formed recursive JSL. As usual in formalisms same document to fetch other schemas that have been thatmixrecursionandnegation,givingaformalseman- predefined in a reserveddefinitionssection3. For ex- tics forJSONSchemais nota straightforwardtask. As ample,thefollowingschemavalidatesagainstanyJSON an example of the problems we face, consider the fol- which is not a string following the specified pattern. lowing JSL expression. 3Definitionsandreferencescanalsobeusedtofetchschemas in different documents or even domains; here we just focus γ1 “ γ1 on therecursivefunctionality. γ1

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.