ebook img

Entity Attribute Value Style Modeling Approach for Archetype Based Data PDF

30 Pages·2017·3.95 MB·English
by  
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Entity Attribute Value Style Modeling Approach for Archetype Based Data

informatio n Article Entity Attribute Value Style Modeling Approach for Archetype Based Data ShivaniBatra1,ShellySachdeva1,*andSubhashBhalla2 ID 1 DepartmentofComputerScienceandEngineering,JaypeeInstituteofInformationTechnologyUniversity, Noida201304,India;[email protected] 2 DivisionofInformationSystems,UniversityofAizu,Fukushima965-8580,Japan;[email protected] * Correspondence:[email protected];Tel.:+91-120-2400973 Received:31October2017;Accepted:20December2017;Published:29December2017 Abstract: EntityAttributeValue(EAV)storagemodelisextensivelyusedtomanagehealthcaredata inexistingsystems,howeveritlackssearchefficiency. Thisstudyexaminesanentityattributevalue stylemodelingapproachforstandardizedElectronicHealthRecords(EHRs)database. Itsustains qualities of EAV (i.e., handling sparseness and frequent schema evolution) and provides better performanceforqueriesincomparisontoEAV.ItistermedastheTwoDimensionalEntityAttribute Value(2DEAV)model. Supportforad-hocqueriesisprovidedthroughauserinterfaceforbetter user-interaction. 2DEAVfocusesonhowtohandletemplate-centricqueriesaswellasotherhealth queryscenarios. 2DEAVisanalyzed(intermsofminimumnon-nulldensity)tomakeajudgment about the adoption of 2D EAV over n-ary storage model of RDBMS. The primary aim of current researchistohandlesparseness,frequentschemaevolution,andefficientquerysupportaltogetherfor standardizedEHRs. 2DEAVwillbenefitdataadministratorstohandlestandardizedheterogeneous datathatdemandshighsearchefficiency. Itwillalsobenefitbothskilledandsemi-skilleddatabase users(suchas,doctors,nurses,andpatients)byprovidingaglobalsemanticinteroperablemechanism ofdataretrieval. Keywords: entity attribute value model; Electronic Health Records (EHRs); standardization; archetypebasedEHRs 1. Introduction Efficientdatamanagementandfasteraccessproceduresareessentialforhealthcareapplication. Healthcaredatamanagementposesvariouschallengesintermsofsparseness(highpercentageof nullvalues),frequentevolution(changesinschema,andthus,changesincorrespondinghealthcare application)andquickdataaccess. • Sparseness: Amongthevastdimensions(attributes)ofhealthcaredomain, onlyfewareactive forapatient[1]. Forexample,apatientwithfevermightnotundergoanybloodtest,andthus, thecorrespondingattributeswillcontainnull(sparse)values. • FrequentEvolution:Withtime,medicalknowledgeevolves.Thisresultsinnewdiagnosisparameters for providing more accurate decisions. For example, a few years back, four-dimensional (4D) ultrasoundtechnologyhadbeenintroducedthatassistedinabetterunderstandingofthefetus. Thisrequireschangesinexistingdatabaseschema(andthus,changesincorrespondinghealthcare application)foraccommodatingthenewknowledgeintermsofattributes/parameterstoberecorded andpresentedtotheuserondemand. • QuickDataAccess: Dataextractioncanbeforaspecificpatientorforapopulation. Whenpatient data or population is extracted, target data can be characterized as rows and columns of a relationalmodel,respectively. Extractingpatientdatainstantlyishighlydemandedinhealthcare Information 2018,9,2;doi:10.3390/info9010002 www.mdpi.com/journal/information Information 2018,9,2 2of30 domain as it can lead to life loss. Whereas, population queries need not be answered in real time[1]. But,irrespectiveofthetypeofquery,fasterdataaccessisalwaysappreciable. RDBMSfollowthen-arystoragemodel(NSM)approach,inwhichatableconsistsof‘n’columns (one per attribute) [2]. A typical normalized relational database follows good ER design policy. A sparse table can be divided into multiple non-sparse normalized tables with a good ER design. However, a frequent evolution of schema in a good ER scenario is very expensive and seeks the involvementofaschemadesignerforchangesintheschema.Changesinschemacanbetheadditionof anattribute,deletionofanattributeordatatypemodificationcorrespondingtoanattribute. Addition of an attribute is the most expensive operation that might result in the creation of new tables and correspondingrelationships. Theexistingdataisthenmovedtothenewlydefinedtables. Inaddition, everytimethattheschemaismodified,theinformationsystembuiltontheunderlyingschemaneeds tobemodifiedaccordinglyandretestedforflawlessoperation. In relational database literature, various data storage models (such as, entity attribute value model [3], decomposed storage model [4], wide table [5], and interpreted storage [6]) have been proposedforstoringsparsedatasets. Healthcaredomainpersisteddataintovariousdatabases(suchas XML,Node+Path,archetyperelationalmapping,anddynamictables)basedonthesedatastorage modelsaswell[1,7–10]. TheEntityAttributeValue(EAV)modelisobservedtobethemostwidely adopted storage model in clinical systems [11]. The EAV [1] model has a fixed schema structure consisting of three columns, referred to as Entity, Attribute, and Value. The ‘Entity’ column will store the contents of the primary key, the ‘Attribute’ column will store the name of the attribute, andthe‘Value’columnwillstorethedatavalue. Foreachnon-nullentry(exceptprimarykeyentries) in the relational table, one row in the EAV table is constructed. Thus, EAV stores only non-null values. Also,EAVmodelenhancesflexibilitybyallowinganynumberofattributestobeaddedbyjust specifyingitsnameinthe‘Attribute’column. Thisenablesnochangesinschemaandtheunderlying informationsystem. Existing research [1,11–13] in healthcare/biomedical, as well as other domains, such as e-commerce [3] and semantic web [14], strongly favors EAV. EAV was first employed in the TMR (TheMedicalRecord)system[15]andtheHELPClinicalDataRepository[16–18]. Thepresenceof a single ‘value’ column in EAV hinders the ability to use multiple data types. Thus, to deal with heterogeneity, the open-source TrialDB clinical study data management system [19,20] explored the use of multiple EAV tables (one table for each data type). Further, EAV has been extended to incorporate relationships among various medical concepts through the EAV/CR framework [21]. Yale’s SenseLab [22,23] used the EAV/CR framework to build a publicly accessible neuroscience database. EAV is also utilized by Oracle’s health sciences division in its commercial systems ClinTrial [24] and Oracle Clinical [25], for modeling clinical data attributes. Nowadays, many commercial applications utilize various aspects of EAV internally, including Oracle Designer [26] (forERmodeling),andKalido[27](fordatawarehousingandmasterdatamanagement). A good ER design may result in thousands of tables. For example, in the clinical domain, hundredstothousandsofrelationaltables(onecorrespondingtooneform)needtobegenerated[11]. Intermountain Healthcare’s enterprise data warehouse involves 9000 tables and 10 terabytes of data[11,28]. Rowsintables(correspondingtoagoodERdesign)mayvaryfromfewtothousandsor millionsinnumber. However,visually,tableswithhugenumberofrowsareemphasizedequallyto thetableswithfewrows. ThetableswithfewrowsaresuitablecandidatesforEAVrepresentation. Forexample,inontologymodellingenvironment,categories(classes)mustoftenbecreatedonthefly, andsomeclassesareofteneliminatedinsubsequentcyclesofprototyping[29]. Thissituationisbest candidateforEAVthatcanaccommodatechangesinclasseswithoutschemachange. Primarily,currentresearchisfocusedonfasterdataretrievalandcomplexad-hocquerysupport inadditiontothecharacteristics(handlingsparsenessandfrequentevolution)ofEAVmodel. Information 2018,9,2 3of30 • Fasterdataretrieval: BesidestheextensiveadaptabilityofEAVinthehealthcaredomain,EAVlacks searchefficiency. Fordataextraction,anexhaustivescanofEAVtablesisrequired,whichadds adelayinprocessingtheoutput. Thus,thisresearchpaperproposesanentityattributevalue stylemodelingapproachforstandardizedElectronicHealthRecord(EHR)databases. Thenew storagemodelproposedistermedasTwoDimensionalEntityAttributeValueModel(2DEAV) thatextendstheEAVtoprovidefasteraccessibilityforpatient-specificandpopulationqueriesin comparisontoEAV.2DEAVusesamixtureoftheEAVmodelandtheNSMofRDBMStoproduce agenericstoragesystemthatstoresonlynon-sparsedataandcanaccommodatenewknowledge withabetteraccessspeed. • Complexad-hocquerysupport:EAVdatabaseiscomplementedwithmetadatatostoreallschema relateddetails.EAVmodelrepresentsthephysicalstructure(i.e.,howdataisactuallystoredondisk) andthemetadataassociatedwithitdepictsthelogicalstructure(i.e.,howdataisvisibletoendusers). However,inthehealthcaredomain,endusers(suchas,doctors,nurses,patients,andpharmacists) neednotrequireanyknowledgeaboutthephysicaland/orlogicalstructureofdata. Moreover, everyuserisnotawareofSQL.EvenifauserisawareofSQL,thequerycorrespondingtoEAVwill behighlycomplexanderror-prone[1].Thus,aGraphicalUserInterface(GUI)mustbeprovided tothemedicaldomainuserforaccessingElectronicHealthRecords(EHRs)thatarestoredinthe databaseofahealthcareapplication. CurrentresearchprovidesaGUIcorrespondingto2DEAV storagesystem.TheGUIgeneratestheSQLquerycorrespondingtoad-hocqueriesonfly,suchas querythatisconstructedbydesktopresidentquerytools,toextractthedesiredinformationwithout anypriorknowledgeabouttheunderlyingschema(2DEAVinourcase)providedtotheproposed GUIasaninput. Healthcare domain demands for frequent unambiguous exchange of data for better medical assistance to patients. Semantic interoperability and standardized data representation are crucial tasks in the management of modern clinical trials [30]. Many standards (such as openEHR [31], ISO13606 [32–34], and HL7 [35]) are making guidelines for standardized EHRs. The dual model approach[36],hasbeenintroducedintheSynapsesproject[37]andadoptedbytheopenEHRstandard. TheopenEHRprojecthasdevelopedclinicalmodel-drivenarchitectureforfuture-proofinteroperable EHRssystems[38]andcanbeharmonizedwithotherstandards[39]. Inthisstudy,weadoptadual modelapproachforstandardizationofEHRs. Itdividestheframeworkofdefiningmedicalconcepts electronicallyintotwolevelsthatsegregateknowledgefrominformation,asshowninFigure1. The reason behind adopting a dual model approach is the need for flexibility in adding new medical concepts without modifying the existing system. Level 1 in the dual model approach is referredtoastheReferenceModel(RM)[33,36],whichdefinesthebasicbuildingblocks(suchasdata typesanddatastructures)ofvariousmedicalconcepts. Level2,referredtoastheArchetypemodel (AM)[36],makesuseoftheinformationdefinedinRMtoproducecompleteknowledgeregarding amedicalconcept(intheformofonlineavailabledeliverablesknownasanarchetypeinopenEHR paradigm). AMappliesconstraintsontheinformationthatisprovidedinRMtobuildarchetypes. Medicalconceptsarepresentedintheformofarchetypes. Anarchetypeconstitutesallknowledge regardingonemedicalconcept,suchasthevariousattributesthatareincludedinagivenmedical concept, their data types, their range, and any other constraints [34,36]. For example, the Body Weightarchetypeconstitutestwoattributes,termedas‘Weight’and‘Comment’,whosedatatypes are ‘Quantity’ (quantifiable) and ‘Text’ (textual), respectively. An archetype may logically include other archetypes, and/or a specialization of another archetype. Thus, they are flexible and vary inform. Intermsofscope,theyaregeneral-purpose,reusableandcomposable[40]. TheopenEHR archetypesarefreelyavailablefordownloadfromastandardonlinelibrarysuchasClinicalKnowledge Manager(CKM)[41]. ArchetypesalsoprovidelinkstostandardterminologiessuchasSNOMED-CT (Systematized Nomenclature of Medicine Clinical Terms) [42] to avoid any ambiguity of terms. Anarchetypeisauthoredafterarigorousreviewprocessthatinvolvesateamconstitutingclinical experts and information technology experts [41]. For each attribute in an archetype, a maximum Information 2018,9,2 4of30 possibleoccurrenceisdefined. Thus,anattributecanbeoptionalormandatory. Differenthealthcare organizations can tailor their needs by inheriting the information stored in AM (in the form of archetypes) through the use of templates [36]. For example, a gynecology department will have a template designed using archetypes that are related to pregnancy, menstruation, and other women-specificdiseases. ThisstudyhasmadeuseofvarioustemplatestocollectstandardizedEHRs dataforexperimentation. Thecurrentresearchaimstoreducethedelayinaccessingthestandardized EHIRnfsordmaattioan. 2017, 9, 2 4 of 30 Figure 1. Dual model approach. Figure1.Dualmodelapproach. Medical concepts are presented in the form of archetypes. An archetype constitutes all 1.1. RelatedStudeis knowledge regarding one medical concept, such as the various attributes that are included in a given meSdpiacrasle cnoenscseipnt,m thedeiirc adladtao mtyapiensi, sthatetirri brauntegde,t oanwda radnsyt hoethdeirf fceornesnttrapirnotcse [d3u4,r3e6s].a nFodrp erxoacmespslees, btheein g Body Weight archetype constitutes two attributes, termed as ‘Weight’ and ‘Comment’, whose data followedbydistinguishedhealthcareproviders. Forexample,abloodpressure(archetype)constitutes types are ‘Quantity’ (quantifiable) and ‘Text’ (textual), respectively. An archetype may logically of five attributes- systolic, diastolic, mean arterial, pulse pressure, and comment. A doctor in one include other archetypes, and/or a specialization of another archetype. Thus, they are flexible and hospitalmayrecordbloodpressureintermsofsystolicanddiastolicpressure,whereasanotherdoctor vary in form. In terms of scope, they are general-purpose, reusable and composable [40]. The mayrecordbloodpressureintermsofmeanarterialpressure. Moreover,therearemanysituations openEHR archetypes are freely available for download from a standard online library such as where all the parameters are not recorded for a particular medical scenario. Thus, sparseness is Clinical Knowledge Manager (CKM) [41]. Archetypes also provide links to standard terminologies introduced. Manyapproacheshavebeensuggestedtohandledataefficientlyasfollows: such as SNOMED-CT (Systematized Nomenclature of Medicine Clinical Terms) [42] to avoid any ambiguity of terms. An archetype is authored after a rigorous review process that involves a team 1. NoSQL systems have been introduced to overcome limitations of Relational Database constituting clinical experts and information technology experts [41]. For each attribute in an ManagementSystems(RDBMS)(SeethenextfollowingSection1.1.1). archetype, a maximum possible occurrence is defined. Thus, an attribute can be optional or 2. Inthehealthcaredomain,dualmodelapproachopensanewpathforhandlingdatatomake mandatory. Different healthcare organizations can tailor their needs by inheriting the information a stable system that can capture future knowledge without making changes in the existing stored in AM (in the form of archetypes) through the use of templates [36]. For example, a application(SeethenextfollowingSection1.1.2). gynecology department will have a template designed using archetypes that are related to 3. preVganraionucys,s mtoernagsteruaaptpiorno,a achneds ootvheerr ewxoismtienng-sRpDecBifMicS daisreeassuesg.g Tehsties dsttuodmy ahkaes tmheadsey sutesem ocfo vmarpioatuisb le temwpitlhatefus tutor ecoevlloeclut tsiotanndanardd/izoerdt oEaHvRosid dsaptaar sfoern eesxsp(esreiemtehnetanteioxnt.f oTlhloew ciunrgreSnetc trieosnea1r.c1h.3 )a.ims to reduce the delay in accessing the standardized EHRs data. 1.1.1. WhyNotNosql? 1.1. Related Studeis NoSQLdatabasesaregainingpopularityasastorageoptionforhighlysparseandfrequently evolvinSgpdarastean.esNs oinS QmLedsiycsatle mdosmaarien aisim aettdribauttepdro tvoiwdianrdgss tchhee mdiaff-eleresnsts purpopcoedrturfoesr adnadta psrtoocreasgsees to attabieninflge xfoibllioliwtye.dH boyw deivsteirn,gtuoiashtteadin hfleaelxtihbcialirtey parboavniddeorns.i nFgors cehxeammapleen, tair ebllyooids nportesasugroeo d(arocphteiotynp[e4) 3]. Nacnodnksatirtnuitessta otefd fitvhea taNttroiSbQutLesd- astyasbtaosleic,( sduicahstaosliCc, amsseaannd raar)tecraimal,e ptoulssaem perecsosnucrleu, siaonnd acnodmimnterondt. uAce d CQdLofcotorrs icnh oenmea hdoespfiintaitli omnaya nredcdoradta bmlooadn ippruelsastuioren ifno treprmrosd ouf cstyivstiotylico afndde vdeialosptoelrics .pressure, whereas another doctor may record blood pressure in terms of mean arterial pressure. Moreover, there are many situations where all the parameters are not recorded for a particular medical scenario. Thus, sparseness is introduced. Many approaches have been suggested to handle data efficiently as follows: 1. NoSQL systems have been introduced to overcome limitations of Relational Database Management Systems (RDBMS) (See the next following Section 1.1.1). Information 2018,9,2 5of30 For data analysis, most of the NoSQL databases leverage Hadoop MapReduce functionality. ThistakesuserawayfromstandardSQL,whichrendersalargenumberofSQLbasedthirdparty analytical tools that are present in market (such as IBM Cognos, Tableau, SAP Business Objects, and Microstrategy) unusable [44]. Also, several systems (such as Hadapt, Hive, and Impala) that provideanSQLinterfacetothedataresidinginHadooprequireanexternalschemadefinitionfromthe usertoenabledataanalyticsviaSQL.Inaddition,someofNoSQLsystems(suchasApacheCassandra) haveintroducedtheirownSQL-likequerylanguages(suchasApacheCassandraintroducedCQL)to provideamoreuserfriendlyandeasierlearningenvironmenttodevelopers(thatareinpracticeof usingSQL).ProvidingSQL-likequerylanguageenablesschemadefinitionatlogicallevel. Thisschema supporthelpsinbuildingaconstrainedstoragesystemtoavoidincorrectdata. InordertoutilizethecapabilitiesofRDBMS,suchastransactionalsupport,storage-integrated access-control,read-writeconcurrencycontrol,statisticsgathering,andcost-basedqueryoptimization capabilities,authorschooseRDBMSforbuildingtheproposedstoragesystem. Inaddition,NewSQL system[45]alsofavorstomaintainACIDproperties,whichisinherentlyprovidedinRDBMS.However, NSM is expensive to evolve [3,5,7–10], and is restricted up to a certain limit to avoid disk page overflow[46]. ToattaintheflexibilityofferedbyNoSQL,wehaveadoptedanapproachsimilarto key-value approach (of NoSQL) i.e., EAV on RDBMS. In addition, a query builder is provided to make user free from the burden of writing complex queries, and thus, interact with the proposed systemflawlessly. 1.1.2. StorageofopenEHRStandardBasedData TherearethreeopenEHRbasedapproaches. ThesearecomparedinTable1. Table1.openEHRbasedpersistenceapproaches. Persistence ModelingLevel Advantage Limitation Approach Simpleprocessofcreating Deephierarchypresentin ObjectRelational tablesforclassesdefinedin ReferenceModel openEHRRMstructure Model(ORM)[7] RM.Also,stableinnature complicatestheORMscenario. since,RMisstable. XML,JSON, Complicatedpathsthatconsume BLOBisusedtodenotethe &Node+Path(using ReferenceModel morespaceaswellascausedelay hierarchyinformofpath. BLOB)[8] indataaccess. Schemaevolutionrequiresefforts Onerelationaltable ArchetypeRelational fromschemadesignertomodify Archetype iscreatedcorresponding Mapping(ARM)[9] schemaandthus,changesin toonearchetype. applicationcode. The first approach, Object Relational Mapping (ORM) provides a set of relational tables that cancapturedetailsofanyobjectdefinedinhealthcaredomain. However, theORMapproachgets complicatedaslevelsofhierarchyincreases[7]. SinceRMdescribesadeephierarchy,ORMisnotwell suitedforstoringstandardbasedhealthrecords. Thesecondapproachsuggestscapturingknowledge abouthierarchy(path)inaBLOB(binarylargeobjects). Variousstorageapproaches,suchasXMLand JSON,exploitBLOBtostoreopenEHRcomplaintdata. OneoftheapproachesutilizingBLOBinan RDBMSisNode+Path[8]. Node+PathapproachissimilartoEAV;bothusesemanticpathsasattribute names. Wangetal.[9]suggestanArchetypeRelationMapping(ARM)approachasanoptimization ofORM.ItproposestouseoneNSMtableperarchetype,withsomeadditionalmetadatatablesto supportschemaevolution. Eacharchetypeismappedtoonerelationaltableusingadefinedsetof mappingrules. Information 2018,9,2 6of30 1.1.3. StorageApproachesinRDBMSforSparseDataset Todealwithsparseness,manyeffortsaredoneatthephysicallayer[5,47,48],aswellasthelogical layer[1,10]ofRDBMS.VariousstorageapproachessuggestedinRDBMSliteraturearedetailedinthe followingsubsections. • PhysicalLevelModification Beckmannetal.[5]proposedamodificationofphysicallayer(termedasinterpretedattribute storageformat)ofroworientedRDBMS.Itreplacedthefixedlengthtuplebyvariablelengthtuple storingnon-nullattribute-valuepairsandassociatedlength. Thisapproachprovidesefficientstorage butdegradestheperformanceofpopulationqueries(duetovariablelengthtuples). Microsoftincluded‘SparseColumn’[47]functionalityinSQLServer2008andlaterversionsto mitigatetheeffectofsparseness. However,‘SparseColumn’fallsshortinfollowing: 1. ‘Sparse Column’ functionality is not applicable to many data types that are quite common nowadays,suchas‘String’,‘Timestamp’,and‘Geometry’,etc. 2. Noconstraintscanbeappliedto‘SparseColumns’. 3. Nodatacompressionispossiblefor‘SparseColumns’. 4. Copying data from one machine to another will result in a loss of the ‘Sparse Column’functionality. Inadditionto‘SparseColumn’,Microsoftintroducedcolumnstoreindex[49]inSQLServer2012 andlaterversiontoenhancetheperformanceofpopulationqueries. Also,manycolumnarRDBMSare incorporatingvariouscompressiontechniquestohandlesparseness[48]. However,alittlespaceis wastedevenaftercompression. Forexample,nullbitmapreservesonebitforeachnullvalue. Existingsolutions(discussedinthissub-section)dealwellwithsparsenessandalsoenhancethe performanceofqueries,butfallsshortincaseofschemaevolution. Asdiscussedbefore,withevolvingschema,theschemadesignerneedstoredesigntheERfor incorporatingnewlyevolvedattributes. WithanewgoodERdesign,correspondingchangesmust be reflected to the database schema, as well as the application. This incurs extra cost and the loss of stability. Therefore, it has been suggested to adopt a generic schema for storing standardized EHRs[12,13]whenconsideringfrequentschemaevolution. Partitionacross(PAX)[50]dividethen-arytableintomultiplepagesandeachpageisvertically partitionedincache. ThisenablestheimprovementinsearchefficiencyofOLTPqueriesbykeeping wholetupleincache. Simultaneously,OLAPperformanceisimprovedsincespatiallocalityofattribute dataisimprovedbyverticallypartitioning. Anotherapproach,termedasfracturedmirror[51],providestwodiskimagesofsamedataset. Eachdiskimagehastwosamefragmentsofthedataset,butdistinctphysicalorganization.Forinstance, Disk1willsavefragment1asn-arytableandfragment2asverticallypartitioneddataset. Whereas, Disk 2 will save fragment 1 as per vertically partitioned dataset and fragment 2 as n-ary table. Dependinguponthetypeofquery(OLTPorOLAP),thebestorganizationischosenbytheoptimizer. HYRISE[52]alsoworksinthedirectionofprovidingefficientstoragemechanism. Itprovides a storage hybrid architecture that inherits advantages of NSM and vertical partitioning. It creates variablelengthpartitionsofthewholedatabase. Eachpartitioncanbestoredeitherasn-arytableor verticallypartitionedtableaspertheunderlyingrequirements. Ifattributesareaccessedfrequently, thechoiceofstorageshouldbeverticalpartitioning,suchasincaseofOLAPqueries. Foraccessing rowspecificdata(OLTPscenario),NSMisopted. Pinneckeetal.[53]haspresentedasurveyofvariousstorageapproaches,includingPAX,fractured mirrors,andHYRISE,whichdonotdealwithsparseness;however,theyprovidestoragewithimproved searchefficiency. • LogicalLevelModification Information 2018,9,2 7of30 EAVisawidelyadoptedlogicallevelapproach. However,searchinefficiencyofEAVdemands for other storage models, or enhancement to existing EAV structure. An alternate approach for storing sparse dataset is to create one binary table corresponding to each attribute of a relational table [10,54]. The first column of the binary table contains primary key and the second column definesthecorrespondingattribute. Itimprovestheperformanceofpopulationqueries. However,the performancesofpatient-specificqueriesworsen. Patientspecificqueriesextractentityspecificdata thatneedstobescannedinallthebinarytablesofdatabaseirrespectiveofthefactthatonlyafew binarytablesareapplicableforunderlyingentity. Sparsedatasetexhibitsaspecialcharacteristicthatentitiesarelikelytohavethesamesubsetof non-nullattributes[46]. Theelementsofthissubsetaretermedasco-occurringattributes. Toimprove theperformanceofpatientspecificqueries,abetterapproach(thanconstructingbinarytables)isto groupco-occurringattributesinonerelationaltable. Informationregardingco-occurringattributes needstobediscoveredbeforehand. Clusteringalgorithms,suchasK-NearestNeighborcanbeapplied toidentifyco-occurringattributes. Baumgartneretal.[55]presentedanefficienttechnique,termedas SURFING(subspacesrelevantforclustering),whichidentifiesclusterbasedonrelevance. Relevance is identified based on interestingness of a subspace using the k-nearest neighbor distances of the objects. Irrespectiveoftheefficiencyofclusteringalgorithm,astheschemaevolves,theco-occurring attributesmayalsoevolve,whichinturn,mayrequirerebuildingofanapplicationbuiltontheprevious schema(asincaseofgoodERdesign). Currentresearchalsoconsidersco-occurringattributesfor thepartitioningofdataindifferenttables. However,partitionsintheproposedapproachfollowEAV modelandinformationregardingco-occurringattributesareextractedfromtheopenEHRarchetype definition(asdetailedinnextsection). • WideTableApproach Besidesphysicallevelandlogicallevelmodifications, apopularRDBMSapproachistostore a large number of entities belonging to the same entity set in one wide table [4]. This approach reliesonthecompressiontechniquesthatareofferedatphysicallayerbytheunderlyingcolumnar RDBMS.Thewidetableapproacheliminatestheinvolvementofaschemadesignerasschemaevolves. However,schemaevolutionisstillexpensiveasmodificationsneedtobereflectedinthecorresponding informationsystem. Anotherpointofconcernrelatedwithwidetableisqueryingdatasetinvolvinga hugenumberofattributes,sincetheusercannotrememberthousandsofattributesthatareinvolved inthedataset. Moreover,adrop-downmenuisnotanefficientoptionbecausescrollingthousands ofattributesdonotseemtobeafeasiblesolution. Thus,Chuetal.[4]proposedakeywordsearch mechanismtoprovidetheuserwithpotentialdesiredattributeset. Keywordsearching[56]isgoodas theuserdoesnotneedtorememberalltheattributes. However,itisnotalwayspossibletoretrieve onlydesiredattributes(additionalattributesarealsoextractedwithmatchingkeyword). • MaterializedViews Materializedviews[57]areveryadvantageousforstoringresultsofqueriestoavoidlongrunning calculationseverytimethataqueryisexecuted. Atthephysicalstoragelevel, materializedviews behave like indexes. It is easy to add attributes to an NSM table without making changes to the materializedviews. Aslongastheusersaccessdatathroughviews,therelationshipsbetweenthe tablescanbechangedwithoutdisruptingqueries.Thus,agoodERcanbefollowedandtheunderlying tables that are physically stored in the database do not need to be sparse. Views can provide an efficientsolutioninsituationswhereunderlyingqueriesarestatic(notchanged). However, inthe healthcaredomain,parameters(attributes)evolvefrequently.Thenewlyevolvedparametersneedtobe accommodatedinanexistingdatabaseandmustbeinquiredtoreflectpatients’situation. Thus,having astaticviewcannotresolvetheissueoffrequentevolution. Information 2018,9,2 8of30 1.1.4. PerformingAnalyticalOperations Databasesaredesignedtosupporttwocategoriesofoperations,i.e.,transactional(thatusually demandspatientspecificdata)andanalytical(thatrequirespopulationdata). Apopulartool,termed asInformaticsforIntegratingBiologyandtheBeside(i2b2),usesEAVforthepurposeofdatastorage toperformanalyticalqueries[58]. I2b2isaself-servicetoolthathasbeendesigned(andbecomingade factostandard)specificallyforpatients’cohortidentification. Itallowstoperformapopulation-wide searchtoidentifytheamountofpatientexistingasperastudy-specificcriteriaforfeasibilityanalysis oftheunderlyingstudy[58]. Toprovidefunctionalityofcohortidentificationandfeasibilityanalysis, i2b2performsanalyticaloperationsonpopulationdata. However,itlackssupportforqueryingdata relatedtospecificpatient. Thesystemproposedincurrentresearch(termedas2DEAV)supports extractionofpatientspecific,aswellaspopulation,datawithabetterspeedthanthatofEAV. I2b2 implements an EAV based star schema that enables integration of healthcare data from disparatesources. However,theuseofEAVini2b2hindersconstraintdefinition,andthus,i2b2is completelydependentupontheindividualcontributingsystemsfortheimplicationofconstraints[59]. In contrast, openEHR provides a dual layer modelling approach that segregates the information (inreferencemodel)fromknowledge(inarchetypemodel). Archetypesdefinedataqualityconstraints tobeplacedontheindividualsystemandthecontentofrecordentries[40]. Individualsystemscan useopenEHRarchetypesforimplementingconstraintdefinitions. Thisconstraineddatasetcanbe migratedtoi2b2forpatients’cohortidentification. Haarbrandtetal.[60]proposedanapproachto automatetheprocessofpopulatingi2b2clinicaldatawarehousewithopenEHRcomplaintdataset. Infuture,authorswilltrytoexport2DEAVcomplaintdatatoi2b2(byfollowingasimilarapproachas suggestedbyHaarbrandtetal.) forfurtherenhancingthecapabilitiesof2DEAV. 1.2. ObjectiveofThisResearch Objectivesofcurrentresearcharetoprovide(1)betteraccessspeed;(2)adherencetostandards; (3)capabilitytoaccommodatenewknowledgewithoutmodifyingexistingschemaandinformation system;(4)lessstoragewithnosparseness;and,(5)easeofad-hocquery. Currentresearchaimsto buildanEAVstylemodelingapproach,termedas2DEAV,thatcanbeusedforstorageofhighlysparse andevolvingdatabelongingtohealthcareaswellasotherdomains. 2. Method Ourstudyproposesamodifiedentityattributevaluestoragemodelnamed2DEAV.Itisespecially designedforstoringheterogeneousandarchetype-baseddata. InadditiontothecapabilitiesofEAV, suchashandlingsparseness,genericstructure,andfrequentschemaevolution,2DEAValsoenhances searchingandqueryingcapabilitieswithsupportforqueryingdatarelatedtothedesiredtemplates. 2.1. DesignandImplementationof2DEAV Foundationsof2DEAVliesinpartitioningdatastoredinasingleEAVtableintomultipleEAV tablesusingtwodimensions. Theparameterschosenforpartitioningaredatasemantics,improving spatiallocalityofco-occurringattributesanddifferenttypesofdata. Thefirstdimensionchosenis archetype(basedondatasemanticsandimprovingspatiallocality),andtheseconddimensionchosen isdatatype(basedonstoringheterogeneousdata). Hence,thename2DEAV. Partitioningbasedondatasemanticsplaysanimportantroleinimprovingsearchefficiency[61]. Thus, 2D EAV utilizes the data semantics of medical concepts that are defined in an archetype to createpartitionsofdata. Knowledgerepresentationofclinicalconceptsisthrougharchetypesthat enable semantic interoperability of heterogeneous systems [40]. Another motive behind choosing archetypeasadimensionforpartitioningisimprovingspatiallocalityofco-occurringattributes[46]. Archetypecorrespondstotheparametersthatbelongtotheunderlyingmedicalconcept(whichtends toberecordedtogether,i.e.,co-occurringattributes). Thus,partitioningbasedonarchetypesenables Information 2017, 9, 2 9 of 30 archetype (based on data semantics and improving spatial locality), and the second dimension chosen is data type (based on storing heterogeneous data). Hence, the name 2D EAV. Partitioning based on data semantics plays an important role in improving search efficiency [61]. Thus, 2D EAV utilizes the data semantics of medical concepts that are defined in an archetype to create partitions of data. Knowledge representation of clinical concepts is through archetypes that enable semantic interoperability of heterogeneous systems [40]. Another motive behind choosing archetype as a dimension for partitioning is improving spatial locality of co-occurring attributes [46]. Information 2018,9,2 9of30 Archetype corresponds to the parameters that belong to the underlying medical concept (which tends to be recorded together, i.e., co-occurring attributes). Thus, partitioning based on archetypes penreasbelnecse porfesceon-occec oufr rcion-gocdcautrariinngo dneattaa binle o.nAed toapbtlieo. nAodfoaprtcihoent yopf easrcehliemtyinpaetse etlhimeoinvaetreh tehaed oovfearphpelaydin ogf calpupsltyeriningg calulgsoterritinhgm aslfgoorreitxhtmrasc tfionrg ecxotr-oaccctiunrgr icnog-oacttcruibrruintegs adtetrtaibilust.es details. ooppeennEEHHRR ffoolllloowwss aa rriiggoorroouuss aarrcchheettyyppee ddeefifinniittiioonn pprroocceessss tthhuuss aann aarrcchheettyyppee iiss ccoonnssiiddeerreedd aass aann aauutthheennttiiccaatteedd ssttaannddaarrdd ddeefifinniittiioonn ooff aa mmeeddiiccaall ccoonncceepptt [[4411]].. AAnnyy eevvoolluuttiioonn iinn kknnoowwlleeddggee iiss rreelleeaasseedd aass aa vveerrssiioonneedd aarrcchheettyyppee.. TThhuuss,, cchhaannggeess iinn eexxiissttiinngg ccoo--ooccccuurrrriinngg ggrroouupp ooff aattttrriibbuutteess ccaann bbee hhaannddlleedd tthhrroouugghh 22DD EEAAVV wwiitthhoouutt mmaakkiinngg aannyy cchhaannggeess ((oorr rreebbuuiillddiinngg)) iinn tthhee sscchheemmaa aanndd eexxiissttiinngg hheeaalltthhccaarree aapppplliiccaattiioonn.. DDeettaaiill aabboouutt vveerrssiioonn hhaannddlliinngg iinn 22DD EEAAVV iiss eexxppllaaiinneedd iinn SSeeccttiioonn 44..33.. 22DD EEAAVVs egsreeggraetgeastethse tdhaet adbaatsae dboansedda taonty pdeast(at hteyspeecso n(dthdei mseencosinodn )dtoimsuepnpsioornt)h ettoe rosguepnpeoitryt. Ihnettehreoagbesneenitcye. ofInp atrhtiet ioanbisnegnbcae seodf opnadrtaittaiotnyipnegs ,bmauseltdip loenv adluaetac otlyupmens,s cmorurletsipploen dvianlgueto cdoiflfuemrennst dcoartaretsyppoensd(isnugc htoa sd,ivffaelrueen_ti ndta,tvaa tlyupe_etse (xstu,cvha lause,_ vbaoloulee_ainn)t,a vrealrueqe_utierxedt, ivnaleuaec_hbaoroclheeatny)p aertea breleq.uIinresdu cihn aeascche naarrcihoe,toynpley toanbelev. aInlu seuccohlu am sncewnailrlioco, nontaliyn otnhee evnatlruye acnodluamllno fwtihlle cootnhtearisnw thilel beentnruy lal,nrdes aulllt ionfg thine sopthaersres nwesisll. Tbhei snmulol,t irveastuedltiunsg toinp asprtaitrisoennethsse. aTrchhise tympoetitvaabtleedc ourrse stop opnadritnitgiotno dthatea atyrcpheest(yfpoell otwabelde bcoyrtrheespVoanludeincgo ltuo mdantao ftyupnedse (rfloylilnogwteadb lbeys) .the Value column of underlying tables). LLeett AA bbee aann aarrcchheetytyppeew witihtha tattrtirbiubutetseso fotfh trhereede idstiisnticntcdta dtaattay ptyeps,eis.e, .i,.eD.,T D1,TD1,T D2,Ta2n, danDdT 3D.TT3h. eTsheet osefta ottfr iabtutrtiebsu(tseasy ,(swayit,h weliethm eelnetmsean1,tsa 2a1.,. a.2, a…8), oaf8A) oifs Adi visi ddeivdiidnetdo tihnrtoee thsurebes estusbcsoertrse scpoorrnedsipnogntdointhgr etoe dthisrteien cdtisdtaintactt ydpaetas t(yDpTe1s, (DDTT21,, aDnTd2D, aTn3d). DT3). AAtt11 == {{aa11,, aa22,, aa33}} AAtt22 == {{aa44,, aa55,, aa66,, aa77}} AAtt33 == {{aa88}} EEaacchh aattttrriibbuuttee ssuubbsseett iiss mmaappppeedd ttoo oonnee AArrcchheettyyppee ttaabbllee,, aass sshhoowwnn iinn FFiigguurree 22.. FFiigguurree 22.. DDaattaa ppaarrttiittiioonniinngg mmeecchhaanniissmm aaccccoorrddiinngg ttoo ddaattaa ttyyppee.. For example, blood pressure archetype constitutes five attributes, with two distinct data types, Forexample,bloodpressurearchetypeconstitutesfiveattributes,withtwodistinctdatatypes,i.e., i.e., ‘Quantity’ (four attributes) and ‘Text’ (one attribute). Thus, two archetype tables are defined. ‘Quantity’(fourattributes)and‘Text’(oneattribute). Thus,twoarchetypetablesaredefined. Inthis In this example, the first archetype table (corresponding to ‘Quantity’ data type) contains four example,thefirstarchetypetable(correspondingto‘Quantity’datatype)containsfourattributesonly. attributes only. The second archetype table contains one attribute (corresponding to ‘Text’ data type) Thesecondarchetypetablecontainsoneattribute(correspondingto‘Text’datatype)only. Theprocess only. The process is repeated for all archetypes involved in building the database of healthcare isrepeatedforallarchetypesinvolvedinbuildingthedatabaseofhealthcareapplication. application. 2DEAVsegregatesthedatabasedondatatypes(theseconddimension)tosupportheterogeneity. 2D EAV segregates the data based on data types (the second dimension) to support Inabsenceofpartitioningbasedondatatypes,multiplevaluecolumnscorrespondingtodifferentdata heterogeneity. In absence of partitioning based on data types, multiple value columns typesarerequiredineacharchetypetable.Insuchascenario,onlyonevaluecolumnwillcontaintheentry corresponding to different data types are required in each archetype table. In such a scenario, only andalloftheotherswillbenull,resultinginsparseness.Thismotivatedustopartitionarchetypetable one value column will contain the entry and all of the others will be null, resulting in sparseness. correspondingtodatatypes(asreflectedinValuecolumnofunderlyingtables).Presently,weconsider This motivated us to partition archetype table corresponding to data types (as reflected in Value onlyfourbasicdatatypesforthepurposeofdemonstration:Integer,String,Real,andBoolean.Thesetof column of underlying tables). Presently, we consider only four basic data types for the purpose of datatypescanbeenhancedbysimplyaddingtablesthatarerelatedtothedesireddatatypes. EacharchetypetableisuniquelytermedasaconcatenatedstringofArchetype_ID,anunderscore ( “_”), and its corresponding Data_Type. openEHR provides a unique identification code to each archetype. However, this unique code is a long string that consumes more space (when stored repeatedly)andintroducesadelayindataprocessing(needstobede-serializedattimeofaccess). Thus,for2DEAV,thelongstringidentificationcodeallottedbyopenEHRismappedtoanewunique identificationcodethroughamappingtable. ThismappedcodeservesasArchetype_ID. Following a generic approach (EAV) for each archetype table, facilitates the addition of any number of attributes to existing information system, and the freedom from sparseness. As new Information 2017, 9, 2 10 of 30 demonstration: Integer, String, Real, and Boolean. The set of data types can be enhanced by simply adding tables that are related to the desired data types. Each archetype table is uniquely termed as a concatenated string of Archetype_ID, an underscore (“_”), and its corresponding Data_Type. openEHR provides a unique identification code to each archetype. However, this unique code is a long string that consumes more space (when stored repeatedly) and introduces a delay in data processing (needs to be de-serialized at time of access). Thus, for 2D EAV, the long string identification code allotted by openEHR is mapped to a new unique identification code through a mapping table. This mapped code serves as Archetype_ID. Following a generic approach (EAV) for each archetype table, facilitates the addition of any Information 2018,9,2 10of30 number of attributes to existing information system, and the freedom from sparseness. As new archetypes are added to the archetype repository, schema evolves automatically in 2D EAV storage archetypesareaddedtothearchetyperepository,schemaevolvesautomaticallyin2DEAVstorage system and requires no amendments to the definition of existing information system. systemandrequiresnoamendmentstothedefinitionofexistinginformationsystem. Partitioning of data into multiple tables (as per 2D EAV) results in many tables. Archetypes are Partitioningofdataintomultipletables(asper2DEAV)resultsinmanytables. Archetypesare advantageous because 10–20 basic archetypes are sufficient to build the core of a health application advantageousbecause10–20basicarchetypesaresufficienttobuildthecoreofahealthapplication[62–64]. [62–64]. Around 100 archetypes can constitute a primary care electronic health record [63,64]. Around100archetypescanconstituteaprimarycareelectronichealthrecord[63,64].Similarly,around Similarly, around 2000 archetypes can constitute hospital EHRs, as compared to more than 400,000 2000 archetypes can constitute hospital EHRs, as compared to more than 400,000 active concepts in active concepts in SNOMED CT [42]. So, the resultant 2D EAV storage can have thousands of tables. SNOMEDCT[42]. So,theresultant2DEAVstoragecanhavethousandsoftables. However,thedata However, the data that needs to be accessed will be limited to a few tables. This happens because a thatneedstobeaccessedwillbelimitedtoafewtables.Thishappensbecauseapatientdatarecordings patient data recordings will generally correspond to 10–20 basic archetypes (as required to build the willgenerallycorrespondto10–20basicarchetypes(asrequiredtobuildthecoreofahealthapplication). core of a health application). Other recorded parameters (if any) will contain a null value (not stored Otherrecordedparameters(ifany)willcontainanullvalue(notstoredinEAV). in EAV). InsimpleEAV,anexhaustivesearchtocompletedatamightberequiredforextractingdesired In simple EAV, an exhaustive search to complete data might be required for extracting desired data. Incontrast,2DEAVrestrictsthesearchfordesireddatatothetablescontainingit. Toenablethis data. In contrast, 2D EAV restricts the search for desired data to the tables containing it. To enable restriction,2DEAViscomplementedwithmetadatasupportastwotables,namely: Mastertableand this restriction, 2D EAV is complemented with metadata support as two tables, namely: Master table Templatetable(asshowninFigure3). and Template table (as shown in Figure 3). Figure 3. Modified entity attribute value storage model. Figure3.Modifiedentityattributevaluestoragemodel. The Master table features four columns: ID, Patient_ID, Session, and Template_ID. The TheMastertablefeaturesfourcolumns: ID,Patient_ID,Session,andTemplate_ID.TheTemplate Template table features five columns: Template_ID, Archetype_ID, Data_Type, Attribute_ID, and tablefeaturesfivecolumns:Template_ID,Archetype_ID,Data_Type,Attribute_ID,andAttribute_Name. Attribute_Name. 2D EAV stores a single EAV table as a collection of disjoint EAV partitions. Each 2DEAVstoresasingleEAVtableasacollectionofdisjointEAVpartitions. Eachpartition(termedas partition (termed as Archetype table) corresponds to a distinct data type attributes of an archetype. Archetypetable)correspondstoadistinctdatatypeattributesofanarchetype.Variousarchetypetables Various archetype tables features three columns: ID, Attribute_ID, and Value. featuresthreecolumns:ID,Attribute_ID,andValue. • Master Table: The Master table is designed with the aim of uniquely identifying a patient’s admittancetothehospital. TheMastertablefollowstherelationalapproach,sinceitstoresdata thatcontainsnonullvaluesandhaveafixedschema. IDcolumnistheprimarykeyoftheMaster tablethatstoresauto-generatedsequentialnumberstoidentifyeachentryintheMastertable uniquely. Patient_IDisuniqueforaparticularpatient,butitcannotserveasacandidatekeyina Mastertablesinceapatientcanhavemultipleadmittancestoahospital,andthus,manyentries intheMastertable. SessionisrecordedtosupportatemporalbehaviorofstandardizedEHRs. EachentryintheSessioncolumnconsistsofadate(usingddmmyyyyformat)followedbytime

Description:
approach [36], has been introduced in the Synapses project [37] and NoSQL databases are gaining popularity as a storage option for highly sparse and internet (such as, if systolic pressure ranges between 120 to 139 and Reason behind the larger size of EAV is the presence of multiple 'Value'.
See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.