Sánchez-de-Madariagaetal.BMCMedicalInformatics andDecisionMaking (2017) 17:123 DOI10.1186/s12911-017-0515-4 RESEARCH ARTICLE Open Access Examining database persistence of ISO/EN 13606 standardized electronic health record extracts: relational vs. NoSQL approaches Ricardo Sánchez-de-Madariaga1* , Adolfo Muñoz1, Raimundo Lozano-Rubí2,3, Pablo Serrano-Balazote4, Antonio L. Castro1, Oscar Moreno1 and Mario Pascual1 Abstract Background: Theobjectiveofthisresearchistocomparetherelationalandnon-relational(NoSQL)databasesystems approachesinordertostore,recover,queryandpersiststandardizedmedicalinformationintheformofISO/EN13606 normalizedElectronicHealthRecordXMLextracts,bothinisolationandconcurrently.NoSQLdatabasesystemshave recently attracted much attention, but few studies in the literature address their direct comparison with relational databases when applied to build the persistence layer of a standardized medical information system. Methods: OnerelationalandtwoNoSQLdatabases(onedocument-basedandonenativeXMLdatabase)ofthree differentsizeshavebeencreatedinordertoevaluateandcomparetheresponsetimes(algorithmiccomplexity)ofsix differentcomplexitygrowingqueries,whichhavebeenperformedonthem.Similarappropriateresultsavailableinthe literaturehavealsobeenconsidered. Results:Relational and non-relational NoSQL database systems show almost linear algorithmic complexity query execution. However, they show very different linear slopes, the former being much steeper than the two latter. Document-basedNoSQLdatabasesperformbetterinconcurrencythaninisolation,andalsobetterthanrelational databasesinconcurrency. Conclusion:Non-relationalNoSQLdatabasesseemtobemoreappropriatethanstandardrelationalSQLdatabases whendatabasesizeisextremelyhigh(secondaryuse,researchapplications).Document-basedNoSQLdatabases perform in general better than native XML NoSQL databases. EHR extracts visualization and edition are also document-based tasks more appropriate to NoSQL database systems. However, the appropriate database solution much depends on each particular situation and specific problem. Keywords: Relational database, NoSQL database, Normalized medical information, ISO/EN 13606 standard, Electronic health record extract, Algorithmic complexity, Primary use, Clinical practice, Secondary research use, Document-based task *Correspondence:[email protected] 1TelemedicineandInformationSocietyDepartment,HealthInstitute“Carlos III”(ISCIII),c/SinesioDelgado,4–,28029Madrid,Spain Fulllistofauthorinformationisavailableattheendofthearticle ©TheAuthor(s).2017OpenAccessThisarticleisdistributedunderthetermsoftheCreativeCommonsAttribution4.0 InternationalLicense(http://creativecommons.org/licenses/by/4.0/),whichpermitsunrestricteduse,distribution,and reproductioninanymedium,providedyougiveappropriatecredittotheoriginalauthor(s)andthesource,providealinkto theCreativeCommonslicense,andindicateifchangesweremade.TheCreativeCommonsPublicDomainDedicationwaiver (http://creativecommons.org/publicdomain/zero/1.0/)appliestothedatamadeavailableinthisarticle,unlessotherwisestated. Sánchez-de-Madariagaetal.BMCMedicalInformaticsandDecisionMaking (2017) 17:123 Page2of14 Background the relational model have been proposed and used. In this Electronic Health Record (EHR) Knowledge Management paperwereviseNode+Path[15],developedbyopenEHR, Systems (KMS) or EHR systems (for short) form an essen- andtheArchetypeRelationalMapping(ARM)[16]. tialpartofmedicalcaretoday.However,creating,maintain- The Object Relational Mapping (ORM) exhaustively ing and communicating EHR documents in those systems maps the structure of a standardized EHR extract XML is not at all straightforward. This is due to several factors (eXtensible Mark-up Language) file to a relational data- affecting technical, syntactic and semantic interoperability base[13,17].AnEHRextractisdefinedasaunitofcom- betweeninformationsystems,includingtheinevitablerapid municationofallorpartofanEHRdocumentandisalso change and evolution of medical knowledge. In order to an instance of the ISO/EN 13606 Reference Model (RM) achievesuchgoals,EHRsystemsanddocumentshavebeen [18].ORMimpliestheconstructionofmanytablesrelated normalized in several international standards such as ISO/ through foreign keys representing the complex structure EN13606,openEHRandHL7[1–4]. oftheextractXMLfileandmaydamageperformance. The ISO/EN 13606 and openEHR standards define a dual model that separates information and knowledge Relationalimprovements into two levels of abstraction, thereby guaranteeing openEHR [3] promotes a Node + Path persistence solu- semantic interoperability between systems operating EHR tion that serialises subtrees of the whole extract XML documents[5]. file into BLOBs (binary latge objects) in a few relational Standardized EHR documents constitute information tables, taking advantage of the semantic paths of the files that need to be maintained and stored physically in normalized EHR extracts. This is a simple and flexible those systems. The special nature of medical knowledge solution, but its simplicity causes complex data retrieval that requires the separation into two levels of the dual logic,therebydamaging complex queries [15]. modelcan haveaprofound effecton thewayinformation Archetype Relational Mapping (ARM) [16] is another in EHR documents is structured and how it is stored interesting relational improvement. Node + Path uses a logicallyandphysicallyinadatabasemanagementsystem. general data storage structure that is independent of The dual model used by standardized EHR documents archetypes. Another approach is to generate a database requires the organization of the information following a model to design a persistence layer driven by archetypes. specificstructure,andmedicalknowledgemustalsoadopt This solution builds a new relational schema based on the structure constrained by the archetypes, i.e. special mappings between the dual model archetypes and rela- datastructuresholdingknowledge[1,2,6–8]. tionaltables. Standardized EHRs are a form of big data, from which patterns may be extracted using Data Mining (DM) and Non-relationalNoSQLapproach Machine Learning (ML) techniques to generate new All previous persistence solutions have been based upon knowledge[9].StandardizedEHRextracts(seenextsubsec- an underlying relational database system. However the tion)maybeextractedautomaticallyfromnon-standardized relational paradigm was recently questioned by NoSQL EHR repositories using standard technologies based on (document-based) database systems.A NoSQL (NotOnly W3C(WorldWideWebConsortium)XMLSchemas[10]. SQL) (SQL, Structured Query Language) database pro- Cross-organizational EHR communication will consti- videsamechanism forstorageandretrieval ofdatawhich tute a key component of future health care [10]. Previous is modelled on means other than the tabular relations literature suggests that implementing a fully functioning used in relational databases [19, 20]. A document-based EHR system with participation of all healthcare organiza- NoSQL database system stores documents in any format tions could lead to a USD 77.8 billion benefit for the like XML [21] or JSON (JavaScript Object Notation) [22] UnitedStates[11,12]. as data [23]. NoSQL DBMSs do not substitute existing relationalDBMSs,butmaybeappropriateinspecificsitu- Relationalapproach ations.ManyNoSQLdatabasesstoredocumentsas entire For decades Database Management Systems (DBMS) have BLOBs. They have no schema and do not support either been dominated by the relational model paradigm [13]. joins or atomicity, consistency, isolation, or durability This model has a well-established theoretical background (ACID)properties[24].Sotheymaybeveryinefficientifa which has been well studied and understood, and haslong subpartofadocumentreferencespartsofothersuchdoc- guaranteed consistency and efficiency within database sys- uments through an indirection link, because the whole tems. However, the complex structure of the information referenced document(s) must be processed sequentially adoptedbythenormalizedEHRdocumentsmaycausethe [25].HoweverifthemaintaskcarriedoutbytheDBMSis direct application of the relational model following this a document-based task, a non-relational database may be structure (Object Relational Mapping, ORM) [14] to be appropriate. This is because NoSQL data stores allow complicated and inefficient. Several improvements within stored data to remain in a form that more closely Sánchez-de-Madariagaetal.BMCMedicalInformaticsandDecisionMaking (2017) 17:123 Page3of14 approximates its true representation [24]. And also be- 20,000 real standardized EHR extracts(provided by several cause of the special persistence policies of EHR docu- hospitals) have been stored, retrieved and queried on each ments(seeDiscussionbelow). DBMSinordertocalculatetheirresponsetimes(computa- In the specific case of standardized medical information, tionalcomplexityastestcollectionsizeduplicates).Concur- our problem is not the lack of a schema. In fact we have a rency experiments have also been conducted in order to complex and over specified schema. Recent research [26] comparetheperformanceofrelationalMySQLandNoSQL addressestheautomaticdiscoveryofaschemaindocument MongoDBDBMSfromthatperspective. storesinNoSQLapplicationsinordertosimplifydataman- Node + Path and ARM improvements to the relational agement and to combine both fixed-schema SQL and model are also considered and discussed. Theirperform- flexible-schemaNoSQLinasingledatamanagementsystem ance is illustrated by results previously published in the [27]. Conversely, the ARM relational approach discussed in literature, using similar databases and queries. In this Methods below attempts to directly manage standardized way we get a general perspective of the most important medicalinformationbysimplifyingitscomplexschema. DBMS methodologies which have been used to persist normalized medical information in current state-of-the- NoSQLcurrentstate-of-art artEHR systems. There are numerous whitepapers, blog entries and com- This research covers an investigation into the appropri- mentaries onthe advantagesofNoSQLdatabase systems ateness of relational and NoSQL database systems under [28]. However, there has been little research on evaluating different situations and perspectives. The emergence and theuseofNoSQLdatabasesinthehealthcaredomain[29], utilityofNoSQLdatabasesanditsrelationshiptorelational particularlywithrealisticstandardizedhealthcaredata. systems has not yetbeen sufficiently discussed, in the con- NoSQL databases might offer a solution to the big textofstandardizedmedicalinformationpersistence. amount ofmedical information bottleneck[30,31]. Thereareover150differentNoSQLdatabases,grouped Methods into the following four categories: (1) Key-value store, (2) In order to directly compare different EHR extracts data- Document store, (3) Column-family, and (4) Graph base persistence systems we have used examples of three database [32]. Open source availability of NoSQL data- ofthemostimportantdatabasesystemmethodologies,i.e. bases reduces the overall cost considerably [29]. Within relational (MySQL),document-based NoSQL(MongoDB) one application, different classes of NoSQL databases can andnativeXML(document-based)NoSQL(eXist). beusedsimultaneously,whichisknownaspolyglotpersist- TheseDBMSshavebeentestedagainst threedatabases ence.Ingeneral,eachclass ofNoSQLdatabaseisdesigned formed by 5000, 10,000 and 20,000 ISO/EN 13606 stan- foraspecificpurpose[24,23]. dardized EHR extracts containing alerts, problems and Restricting our description to document-store kind pharmacy information for a lower number of patients MongoDB,itprovideshighperformancedatapersistence (sowithacertain number ofextractsperpatient). to support embedded data models which reduce I/O ac- Theseextractshavebeenputtogetherusinginformation tivity, automatic horizontal scale-up, unstructured data from several Spanish hospitals (Fuenlabrada University model (which suits EHRs), high available distributed sys- Hospital, Barcelona Clinical Hospital and A Coruña tem, denormalized localized data reducing the need for UniversityHospitalComplex)andprimaryhealthcarecen- joins, 1/10 cost relative to relational SQL-based systems tres. These centres work with us in the PITES (Plataforma and better performance than these systems [33]. It also de Innovación en nuevos servicios de TElemedicina y improves big data analysis performance on EHR systems e-Salud) coordinated research project [36]. Information overSQL-basedsystems[34]. comes from different departments of hospitals and health Other NoSQL document-store databases such as carecentres.Itconsistsofheterogeneousinformationres- CouchDB, used on normalized data, perform better than idingininformationsystemsfromdifferentmanufacturers corresponding relational systems on non-normalized (i.e. and vendors. It has been normalized, homogenized and simpler) EHR data, and are promising where ACID prop- centralized using archetype-based data transformation ertiesarenotstrictlyrequired[35]. technologies such as the LinkEHR studio tool [37, 38]. It hasalsobeenproperlyanonymizedusingasolutiondevel- Objective opedbyourUnitinpreviouswork[39]. This research showcases several experiments which have Queries of increasing complexity were performed on been carried out in order to directly compare the imple- these databases,againstthe information contained in the mentation of the persistence layer of an EHR system problems list. Response times to these queries were using three different DBMS: one relational (MySQL) and calculated, in order to compare the performance and the two NoSQL (document-based MongoDB and native XML algorithmic complexity of the three DBMSs methodolo- eXist).Threeincreasingsizecollectionsof5000,10,000and gies(see below). Sánchez-de-Madariagaetal.BMCMedicalInformaticsandDecisionMaking (2017) 17:123 Page4of14 We also provide a short description of two separately to fit into relational structure [23]. This motivate us developed and related methodologies. Their figures will to study its performance as the persistence level in an be shown in the results section, in order to provide a EHR system. broad perspective and an insight into database persist- enceonstandardizedmedicalinformation. BuildingaNoSQLMongoDBdatabasesystemtostore andquerynormalizedEHRextracts BuildingarelationalMySQLdatabasesystemtostoreand As stated above new NoSQL DBMSs (such as MongoDB querynormalizedEHRextracts [40]) have recently attracted the attention of database The relational model for database persistence [13, 17] is system developers mainly in those document-centred a very well established and mature methodology, which persistence applications, where a standard relational ap- isparadigmatic.Itisbasedonwell-knownformalrelational proach maynotbeefficient.Asstated above this NoSQL algebra and calculus, and it has guaranteed the precision approach may provide faster access and more especially and consistency of RDBMSs (Relational DBMS) for a long in the case of very large databases. But it may be very time.Recently,alternativemethodologies(i.e.NoSQLdata- inefficient if document data contain references or links bases)haveattractedtheattentionofpracticaldatabasesys- to other similar pieces of documents in the database tem developers [20]. NoSQL approaches can be faster and [25], and these links affect database consistency. In gen- morescalablewhendatasizesareextremelylarge,orwhen eral, this NoSQL approach is adequate for a so-called therearenointernaldocumentreferencesthatcandamage document-basedapplication. speed or data consistency. However, often these supposi- ANoSQLdatabasesystemsuchas MongoDB canbuild tions will depend on each specific project. For this rea- adatabaseofEHRXMLdocumentsinaquitestraightfor- son, we have implemented a relational MySQL DBMS, ward fashion. Normalized XML documents are provided in order to achieve and evaluate the persistence of ISO/ as input to the system in JSON format, i.e. they must be EN13606standardizedEHRextracts. previously transformed from XML to JSON. JSON may WehaveusedJAXB(JavaXMLBinding)andJPA(Java follow the same structure as XML. In fact the MongoDB Persistence API) technologies in order to automatically system uses a proprietary slight variation of JSON called transform the XML schemas representing the ISO/EN BSON (for Binary JSON that enables binary serialization 13606 standard into a MySQL relational database. These on data), but there are programs which allow for easy XML schemas [6] indicate a standardized EHR extract conversion. Given a BSON version of the original XML system the possibly permitted XML documents this extracts document collection, this is provided as input to extract may adopt, i.e. its legal or valid instantiations. MongoDB,andaworkingdatabaseisconstructedinafast Consequently, JAXB can take these XML schemas as and straightforward manner. BSON/JSON documents are inputinordertoproduce(asoutput)aJavaclass’srepre- thenstoreddirectlyasBLOBsmaintainingtheirstructure, sentation of any such XML extract. These Java class files andtheymaybesubsequentlyqueriedorretrieved. may then be manually tagged with JPA codes, in order MongoDB has its own DBMS including a complete set to generate a MySQL relational database with the struc- of CRUD (create, read, update and delete) operations. ture of these classes, i.e. the structure of the original TheseoperationsarebasedonthetreestructureofJSON/ XML extract file. This database may be used to store, XML and rely on the tree path from root node to leaf query (using standard SQL) and retrieve EHR extracts. nodes, where data may be stored. They may be translated This process is often called Object Relational Mapping to and from standard SQL statements, performing virtu- (ORM) [14]andisdepictedschematicallyinFig.1. allyeveryimportantfeature. ORMmaysuffertheso-called“Object-RelationalImped- Since MongoDB is a document-centred database it ance Mismatch”. This happens when an object is molded produces JSON/XML documents as output, which is an Fig.1OnewaytoperformORMonstandardizedEHRextracts Sánchez-de-Madariagaetal.BMCMedicalInformaticsandDecisionMaking (2017) 17:123 Page5of14 important aspect as it is very appropriate for document- a hybrid serialization Node + Path approach (see sche- based persistence tasks. Consequently if a CRUD oper- maticrepresentationin[15]). ation is launched from a Java application, in order to Thisapproach hastheadvantageofatabularrelational query the extracts database, and if a table-like output is structure with less tables and join operations, and the desired, the JSON output documents must be parsed direct query of fine-grained data using the paths ex- and processed (as a subsequent part of the query), in tracted from archetypes. However, issues still remain ordertoproducesuch relationaltable-like results. about the uniqueness of data (archetype-based paths), fastparsing andcomparison ofpaths, andtheprocessing BuildinganativeXMLNoSQLeXistdatabasesystemto ofcomplex queries. storeandquerynormalizedEHRextracts EHR extracts are codified in XML format. Consequently, Thearchetyperelationalmapping(ARM)persistence anativeXMLDBMSsuchaseXist[41]shouldbeevaluated solution toimplementthepersistencelayerofanEHRsystem.eXist A different approach is adopted in the so-called ARM is an open source management system entirely built on process in order to optimize the ORM relational system XMLtechnology,alsoknownasanativeXMLdatabase.A [16]. We will refer to it in this section since we will be native XML database also provides a mechanism for stor- presentingsomeresultslater. ageandretrievalofdata,differentfromthetabularrelations In the ARM process, a new relational database used in relational databases. Consequently, it may be con- schema, different from the direct relational schema used sideredasaNoSQLdatabase.Beingasemi-structureddata- in the ORM, is generated. This means that information base, it stores data in the form of entire XML documents, elements of an EHR extract as constrained and repre- so it may also be considered a document-based NoSQL sented in the archetypes, are mapped into tables, key, database [19]. Considerations regarding the existence of foreign key and common columns. The structure of the links in EHR extracts stored in MongoDB in the previous information as stated in the archetypes is used to define sectionmayalsobeappliedtotheeXistDBMS,considered thenewrelationalschema,usingthesemappings. asakindofdocument-basedapplication. Using this methodology, archetypes are mapped into TheEHRextractsareloadedbytheeXistDBMSdirectly tables, and archetype basic data types are mapped as com- as XML files, maintaining their structure, and may be mon columns. If their occurrence in the archetype is 1, or subsequently queried and retrieved, using the XQuery lan- into standalone tables with two columns, if their occur- guage, a W3C recommendation. The XQuery language is renceishigherthan1:oneisaforeignkeycolumnreferring able to produce XML-format files output. This means that identification and the other is a common column mapped relational table-like output may be easily generated in the form the data item. Query data items are constrained as formofXML-formattedfiles,contrarytoMongoDB,which indexedcolumns,inordertoimproveperformance. produces entire JSON output that must be post-processed Consequently the simplified structure of the several toyieldrelational-likeoutput(seeprevioussection). archetypes participating in a specific EHR extract is used to build the relational schema used by that extract, in- TheopenEHRnode+pathEHRextractsdatabasesystem stead of using the general structure of the whole RM of We provide a short introduction to openEHR’s Node + thedualmodel,asdoes theORMapproach. Path even though we have not used it, and later we will Asaresult,thenewsimplifiedrelationalschemashould showsomeresults.openEHRhasdevelopedanoptimization be much more efficient than the straightforward (and over the ORM relational methodology based on the EAV complicated) ORM schema. This might have evident per- (entity-attribute-value)model[42].TheNode+Pathpersist- formanceimprovements(seetheResultssection). ence[15]isbasedontheserialisationofinformationobjects However, since some structural information from the (entire extract trees) into single blobs, requiring only one original extract is lost during the process of building the column in a relational database table. Additional indexing simplified relational schema, it is not possible to recover columns are added for attribute values in order to provide the original extract as it was before its storage. Thus we some query ability. This basic approach may be im- can query the medical informationpresent in the extracts proved in two different ways: one (hybrid serialisation) butwecannotrecoverthemintheiroriginal state.Thisis serialising only lower level elements of the object trees, the main reason why ARM has not been implemented in while storing transparently upper level objects (requiring thisresearch. some object-relational mapping, i.e. new tables) and an- other one (Node + Path approach) recording the path (as Otherrelationaldatabaseimprovements in the archetype) of each blob in a two-column table of Other relational DBMS improvements include column- <node path, serialised node value > with an index on the basedsystems[43,44]suchasforinstanceMonetDB[45] pathcolumn.Thetwoimprovementsmaybecombinedin and VectorWise technology [46]. Column-based database Sánchez-de-Madariagaetal.BMCMedicalInformaticsandDecisionMaking (2017) 17:123 Page6of14 systems are based on the fact that conventional row- based on the server’s system clock. In the MongoDB stored systems might need to read in unnecessary data, database a Java application was constructed to query a when performing reads. On the other hand, column- MongoDB 2.6 database on Windows. The Log4j logging stored systems only need to read in relevant data, even tool was used to set transactions at the beginning and at though writes require multiple accesses. Consequently, the end of each query to the server and total times were these systems constitute an optimization for read-intensive computed based on the timestamps of these transac- large data repositories. This improvement will be reasoned tions. The eXist 3.0RC1 DBMS was queried using its laterinthediscussion. Java client, which yields the execution time required by each query. All values were calculated as the average re- QueriesappliedtotherelationalandNoSQLDBMSs sponsetimesoffivequery executions. Table 1 shows six different queries that have been ap- plied to the three size increasing databases of the three Concurrencyexperiments DBMSs (one relational and two NoSQL) in our experi- There are many indicators to assess performance of a ments. Queries were performed on the extracts against DBMS.Oneofthemisthebehaviourofqueryingthedata- the problems information of each patient. Each patient’s base concurrently. In order to evaluate the ORM/MySQL problem hasanumberofattributessuchasname,initial and the NoSQL/MongoDB DBMSs (the NoSQL/eXist date,resolutiondateor severity. DBMS has not been included in these experiments, given Queries range in complexity from one patient to all the isolated queries performance results presented below) patients, from one problem to all problems for each under a concurrency environment we have designed the patient, and from the raw list of attributes to conditions following experiments, inspired by the XMach-1 bench- imposed on those attributes, such as dates being later markforXMLdatamanagement[47]. thanagivendateorseveritybeing‘high’.Ingeneralterms, A Java multithread application was constructed, with with the exception of Q6, the six queries are ordered by three main threads representing three of the presented increasingcomplexity. queries competing for CPU (Central Processing Unit) The queries were originally designed in the context of use. The queries selected were Q1, Q3 and Q4, for two a Machine Learning application aimed at obtaining asso- main reasons: first, these are medical practice queries, ciation rules held by the problems of the patients, i.e. a whicharemorelikelytobeposedinaconcurrentfashion research application, but some were later adapted to i.e. secondary research use queries will be performed in clinicalpractice. isolation; and second, the XMach-1 benchmark recom- Queries Q1, Q3 and Q4 refer to problems and charac- mends the use of queries with short response times (90% teristics of problems of a single patient and are thus to under 3 s) since it is rather easy to increase throughput be typically formulated in a scenario of clinical practice without response time limits. For this last reason, these i.e.primaryuseofmedicalinformation. experiments were only performed in the small 5000 EHR OntheotherhandqueriesQ2,Q5andmaybeQ6relate extractsdatabase. to problems and characteristics of problems of all the Three different priority levels were assigned to each patients of the whole database and are more appropriate query, namely high, medium and low to Q1, Q3 and Q4 forsecondaryuseofinformation(i.e.medicalresearch). respectively, yielding a CPU use distribution of approxi- Response times to queries were calculated in the rela- mately45%,35%and20%.Increasingtimeinmilliseconds tional database using MySQL 5.0.67 on Linux/SUSE. wait and notify instructions were added to decreasing This environment yields the time in seconds of queries prioritythreadsinordertostabilizethisdistribution. These experiments were executed five times during Table1SixqueriesperformedonrelationalandNoSQL 10mineach.Thereafterthemostexecuted(highestpriority) databases queryaveragethroughputandtheaverageresponsetimesof Query thethreequerieswerecalculated. Q1 Findallproblemsofasinglepatient Q2 Findallproblemsofallpatients Indexingpolicies Q3 Findinitialdate,resolutiondateandseverityofasingleproblem Many DBMSs build structure and range indexes auto- ofasinglepatient matically. For instance, in our experiments, the MySQL Q4 Findinitialdate,resolutiondateandseverityofallproblemsof system built 602 such indexes in that way. We only built asinglepatient manually those indexes that would speed up execution Q5 Findinitialdate,resolutiondateandseverityofallproblemsof of some queries. For instance, the field representing the allpatients unique identifier of a patient is a very important attribute Q6 Findallpatientswithproblem‘pharyngitis’,initialdate>= tobeindexedinqueriesregardingonesinglepatient,such ‘16/10/2007’,resolutiondate<=‘06/05/2008’andseverity‘high’ as Q1, Q3 and Q4. Other indexes constructed manually Sánchez-de-Madariagaetal.BMCMedicalInformaticsandDecisionMaking (2017) 17:123 Page7of14 representattributesdemandedinourqueries,suchasspe- Table3Responsetimesinsecondsofthesixqueriesperformed cific problems of patients, initial and resolution dates or onMongoDBNoSQLdatabase severityofthoseproblems(seeTable1). MongoDB 5000 10,000 20,000 slope(*10−6) In the MongoDB and eXist systems, we also tried to Q1 0.0460 0.0570 0.1221 5.07 construct the indexes manually. However, MongoDB Q2 34.5181 68.6945 136.2329 6780.99 emitted an error message stating that those indexes had Q3 0.0480 0.0580 0.1201 4.81 already been constructed by the system. The eXist data- Q4 0.0520 0.0610 0.1241 4.81 base did not return such an error message, but adding the indexes has not changed the response time of the Q5 38.0202 75.4376 149.9330 7460.85 queries, so we assume that those indexes were also built Q6 9.5153 18.5566 36.7805 1817.68 automatically bythesystem. Databasesize 1.95GB 3.95GB 7.95GB Totalextracts 5000 10,000 20,000 Results *standsforthemultplicationsign Table 1 shows the six different queries performed on the relationalMySQL andonthe two NoSQLDBMSsdevel- relational being much slower than the non-relational. oped intheprevious section. eXist presents slower response times than MySQL and Tables 2, 3 and 4 show response time in seconds of MongoDB in the small 5000 extracts database, but inter- the six queries in the three DBMSs with three different mediate slopes, thus beating MySQL in the big 20,000 database sizes, i.e. 5000, 10,000 and 20,000 normalized extracts database, with the exception of Q6, in which EHRextracts. ORM/MySQL behavesbetter. At first glance one can see a strong linear increment Table 5 complements the response times of the three in response times of the three DBMSs as the size of the DBMSs showing the time and space costs of storing and database grows. This linear behaviour may be better retrieving EHR extract XML files on them. These storage appreciated in the nine diagrams of Figs. 2 through 5 times include the indexes being constructed or updated. (notice the different vertical axes scales used throughout The contrast between the fast time costs to store and most of these figures). Figure 2 shows queries Q1 and retrieve documents in the NoSQL databases and the low Q4 almost linear complexity in ORM (up), MongoDB response times in the relational database, and the compar- (down left) and linear complexity in eXist (down right). able memory space costs of the NoSQL systems and the Figure 3 presents linear complexity in MongoDB and relational database is evident. This table also shows the eXist for queries Q2 and Q5 and unbounded response average size in memory of one extract in the database and time in ORM for these queries. Figure 4 shows almost theaveragesizeofoneextractinXMLformat. linear complexity for Q3 in MongoDB and linear com- plexity in ORM and eXist. Finally Fig. 5 displays linear Resultsbyotherimprovedrelationalsystems complexity for Q6 in both ORM and MongoDB, and al- We provide in Table 6, for illustrative purposes, the mostlinear complexity ineXist. results obtained by the two improved relational systems However, the three DBMSs show very dissimilar slopes (ARM and Node + Path) described in Methods above, as intheirlinear behaviour(seeslope column in Tables 2,3 they appear in [16]. Our Table 6 shows the most similar and 4). Whilst MySQL and MongoDB yield very similar queries from Table 5 in that work corresponding to three results in the small 5000 extracts database they diverge ofour queriesfromour Table 1,withtheir corresponding considerably in the big 20,000 extracts database, the responsetimes. Table2Responsetimesinsecondsofthesixqueries Table4Responsetimesinsecondsofthesixqueriesperformed performedonMySQLrelationalORMdatabase oneXistNoSQLdatabase ORM 5000 10,000 20,000 slope(*10−6) eXist 5000 10,000 20,000 slope(*10−6) Q1 0.0429 209.0284 1082.7872 72,182.95 Q1 0.6608 3.7834 7.3022 442.76 Q2 101.6196 >10,000 >10,000 >> Q2 60.7761 129.3645 287.362 15,105.73 Q3 0.1256 4.1982 12.6110 832.36 Q3 0.6976 1.7710 4.1172 227.96 Q4 0.1843 400.6388 1598.7410 106,570.45 Q4 0.6445 3.7604 7.3216 445.17 Q5 200.8954 >10,000 >10,000 >> Q5 145.3373 291.2502 597.7216 30,158.93 Q6 0.7362 65.1898 185.2420 12,300.39 Q6 68.3798 138.9987 475.2663 27,125.82 Databasesize 4.8GB 9.7GB 19.8GB Databasesize 1.25GB 2.54GB 5.12GB Totalextracts 5000 10,000 20,000 Totalextracts 5000 10,000 20,000 *standsforthemultiplicationsign *standsforthemultiplicationsign Sánchez-de-Madariagaetal.BMCMedicalInformaticsandDecisionMaking (2017) 17:123 Page8of14 Fig.2ORM(up)andNoSQL(MongoDBleft,eXistright)responsetimestoqueriesQ1andQ4forthreedatabasesizes Fig.3ORM(up)andNoSQL(MongoDBleft,eXistright)responsetimestoqueriesQ2andQ5forthreedatabasesizes Sánchez-de-Madariagaetal.BMCMedicalInformaticsandDecisionMaking (2017) 17:123 Page9of14 Fig.4ORM(up)andNoSQL(MongoDBleft,eXistright)responsetimestoqueryQ3forthreedatabasesizes Direct comparison with our results is not possible since However, if we compare optimized relational system databasesizesaredifferentasisthetotalnumberofextracts. ARM (Table 6) and non-relational MongoDB system Thismeansthat,forinstance,sincetwocomparablenormal- (Table 3) we can see that the latter beats the former in ized relational systems (optimized ARM and ORM, 5000 both Q1 (Query 2.1) and Q3 (Query 3.1): interpolating extractsdatabase)havequitesimilardatabasesizes(2.90GB MongoDB Q1 and Q3 from the 10,000 and 20,000 ex- and4.8GBrespectively),buttheformerholdsamuchlarger tracts results to a hypothetic 30,000 extracts database numberofextracts(29,743),thesizeoftheextractsusedin (similar to the 29,743 extracts ARM database) response ourdevelopments(244.876KBinXMLformat,seeTable5) times would be 0.1872 and 0.1822 respectively (against should be much larger than that of the normalized EHR 0.1910 and 0.2700 for ARM), even though ARM data- extractsusedintheexperimentsreportedin[16]. base size is quite a bit smaller than the 20,000 extracts Fig.5ORMandNoSQL(eXistandMongoDB)responsetimestoqueryQ6forthreedatabasesizes Sánchez-de-Madariagaetal.BMCMedicalInformaticsandDecisionMaking (2017) 17:123 Page10of14 Table5ShowsspaceandtimecoststostoreandretrieveXML Table7Showsmostfrequentquerythroughputandresponse documentsinthethreeDBMSs timesinconcurrentexecutioninMySQL Retrieval Storage Sizein XMLFile ORM Throughput Responsetime time(ms) time(ms) memory(KB) size(KB) Q1 4711.6 0.0793 ORM/MySQL 6188.5 7569.7 960 244.876 Q3 4711.6 0.1558 NoSQL/MongoDB 14.0 35.0 390 244.876 Q4 4711.6 0.9674 NoSQL/eXist 4.9 90.9 250 244.876 likely need improvements. On the other hand the much MongoDB database, but also than the 20,000 extracts flatter linearity of the NoSQL systems merits further relationalORMdatabase,i.e.thesizeoftheARMextracts research, in order to decide the appropriateness of should be smaller. However, comparing Q6 with query document-based database approaches. 7.1,optimizedARMperformsbetterthantheinterpolated DirectcomparisonofTables3and4showthatMongoDB MongoDBdatabase:interpolatingMongoDBQ6to30,000 performs considerably better than eXist in the six queries, extracts would yield response time 55.0044. Notice from yielding a linear but considerably flatter slope in all Tables 2 and 3 that Q6 is also the query where non- cases, and suggesting that document-based NoSQL da- optimized relational ORM scores the best slope relation tabases such as MongoDB are a better solution than withrespecttoNoSQL. native XMLNoSQLDBMSssuchaseXistinordertoper- sistandqueryISO/EN13606standardizedEHRextracts. Resultsoftheconcurrencyexperiments FromTables7and8wecanseethataNoSQLMongoDB Tables 7 and 8 show the average throughput of Q1 (the databaseyieldsmuchhigherthroughputthanORMMySQL most frequent query) and the average response times of and also query execution time is also much faster in the Q1, Q3 and Q4 yielded by the concurrency experiments formerthaninthelatter,forthesixtypesofqueries. describedinSection2.8.Q1achievesmuchhigherthrough- Infact,MongoDBqueriesrunfasterinconcurrencythan putintheMongoDBsettingthanintherelationaldatabase. in isolation. This is because MongoDB query execution It should be noted that all three queries yield much faster contains a considerable amount of time consuming responsetimesinMongoDBthaninMySQL.Itseemsthat administrative and communication tasks that are opti- concurrent execution favours MongoDB, since these quer- mized in an execute-once fashion in the concurrent iesexecutefasterconcurrentlythaninisolation. version, concentrating CPU execution time in the query itself. From this point of view, MongoDB standsas avery Discussion efficient,optimizableandeffectivedatabasesystem. Directcomparisonofresults We observe from the results shown in Tables 2, 3 and 4 Table-likevs.document-likeresults that therelational andNoSQLdatabase systemsuse very Results in relational databases are always presented in a different storage and access philosophies. The very high table-like form, i.e. an SQL-like query always returns a number of tables generated in the pure relational ORM set of values in the form of a relational table or similar. approach induces many expensive join operations, result- A whole document may also be reconstructed, but this inginahighercomputational costas thesizeofthedata- is a fairly slow task, at least in ORM. On the other hand, base grows and showing a much higher linear slope. In a query in a document-based DBMS such as MongoDB contrast, NoSQL time costs also seem to grow linearly might return a whole document (or a modified or simpli- withdatabasesize,eventhoughwithamuchflatterslope. fieddocument)asresult(usuallyXML/JSONdocuments); With the results obtained in Tables 2, 3 and 4 pure rela- butthisdocumentmayalsobefurtherprocessed,inorder tional ORM does not seem practical since response times toproducearelationaltable-likeresult. grow (almost) linearly but at a prohibitive slope, and will However, when medical practitioners make primary use of medical information (clinical practice), they tend Table6Showsillustrativedatafromthreerelationaldatabase to visualize normalized medical information regarding a managementsystemspresentedin[16] ARMpaper IV(sec) ARM(sec) Node+Path(sec) Table8Showsmostfrequentquerythroughputandresponse Q1 Query2.1 0.221 0.191 24.866 timesinconcurrentexecutioninMongoDB Q3 Query3.1 0.242 0.270 294.774 MongoDB Throughput Responsetime Q6 Query7.1 14.582 1.293 41.217 Q1 178,672.6 0.0030 Databasesize 1.60GB 2.90GB 43.87GB Q3 178,672.6 0.0026 Totalextracts 29,743 29,743 29,743 Q4 178,672.6 0.0034
Description: