Ryanetal.BMCMedicalInformaticsandDecisionMaking2012,12:3 http://www.biomedcentral.com/1472-6947/12/3 RESEARCH ARTICLE Open Access Use of name recognition software, census data and multiple imputation to predict missing data on ethnicity: application to cancer registry records Ronan Ryan1*, Sally Vernon2, Gill Lawrence3 and Sue Wilson4 Abstract Background: Information on ethnicity is commonly used by health services and researchers to plan services, ensure equality of access, and for epidemiological studies. In common with other important demographic and clinical data it is often incompletely recorded. This paper presents a method for imputing missing data on the ethnicity of cancer patients, developed for a regional cancer registry in the UK. Methods: Routine records from cancer screening services, name recognition software (Nam Pehchan and Onomap), 2001 national Census data, and multiple imputation were used to predict the ethnicity of the 23% of cases that were still missing following linkage with self-reported ethnicity from inpatient hospital records. Results: The name recognition software were good predictors of ethnicity for South Asian cancer cases when compared with data on ethnicity derived from hospital inpatient records, especially when combined (sensitivity 90.5%; specificity 99.9%; PPV 93.3%). Onomap was a poor predictor of ethnicity for other minority ethnic groups (sensitivity 4.4% for Black cases and 0.0% for Chinese/Other ethnic groups). Area-based data derived from the national Census was also a poor predictor non-White ethnicity (sensitivity: South Asian 7.4%; Black 2.3%; Chinese/ Other 0.0%; Mixed 0.0%). Conclusions: Currently, neither method for assigning individuals to an ethnic group (name recognition and ethnic distribution of area of residence) performs well across all ethnic groups. We recommend further development of name recognition applications and the identification of additional methods for predicting ethnicity to improve their precision and accuracy for comparisons of health outcomes. However, real improvements can only come from better recording of ethnicity by health services. Background the ethnicity of individuals; the use of Census data This paper presents a method for imputing missing data based on area of residence to predict the ethnicity of on the ethnicity of cancer patients, developed for a individuals; and finally, the use of multiple imputation regional cancer registry in the UK. It implements exist- (MI) to make an allowance for the use of these predic- ing approaches in a novel situation and evaluates their tors in subsequent statistical analyses. The method has utility. It combines four differing approaches to dealing applications beyond cancer registries, and the results with missing data of this type: the use of an additional presented below are relevant to all organisations and source of self-reported ethnicity to replace the missing researchersthathaveincompleteinformationontheeth- data; the use of name recognition software to predict nic group of individuals who have access to additional datathatmayhelppredictethnicitywhenmissing. TheWestMidlandsCancerIntelligenceUnit(WMCIU) *Correspondence:[email protected] is a regional cancer registry covering a population of 1PublicHealth,EpidemiologyandBiostatistics,UniversityofBirmingham, approximately5.3million.Theregistrycollectsinformation BirminghamB152TTUK Fulllistofauthorinformationisavailableattheendofthearticle ©2012Ryanetal;licenseeBioMedCentralLtd.ThisisanOpenAccessarticledistributedunderthetermsoftheCreativeCommons AttributionLicense(http://creativecommons.org/licenses/by/2.0),whichpermitsunrestricteduse,distribution,andreproductionin anymedium,providedtheoriginalworkisproperlycited. Ryanetal.BMCMedicalInformaticsandDecisionMaking2012,12:3 Page2of8 http://www.biomedcentral.com/1472-6947/12/3 about new cases of cancer and produces statistics about groups, or the original ethnic group of individuals who incidence,prevalence,survivalandmortality:theavailabil- adopttheirpartner’ssurnamewherethepartnerisamem- ityofinformationonthesociodemographiccharacteristics, berofanotherethnicgroup.Anotherapproachistheuse includingethnicity,ofcancercasesisimportantforservice of census information on the ethnic distribution of the planning, ensuringequalaccessto these services,and for area inwhichthe caselived.Examplesof thisare astudy epidemiological studies. The mainsource of information ontheuptakeofbreastcancerscreeningservicesinLon- ontheethnicityofcancercasesfortheWMCIUislinkage don,andthedevelopmentofariskcalculatorforcoronary tothenationalHospitalEpisodeStatisticsdatabase(HES), heart disease based on data from a large set of general whichroutinelycollectstheself-reportedethnicityofhos- practices [12,13]. This approach has the advantage of pital patients [1]. However, HES currently only provides beingeasytoimplementifthepostcode oftheindividual informationonpatientsadmittedtohospital,anddoesnot is available, but relies on the assumption that the ethnic includepatientswhowereassessedortreatedinsecondary distributionofaCensusarea(approximately1,500people) care,butnotadmitted,nordoesitincludeinformationon isanaccuratepredictoroftheethnicityofindividuals. patientsseenprivatelyoutsidetheNationalHealthService The use of MI, such as that implemented by Royston (NHS).Nationally,only80%ofcancercaseshaveethnicity andcolleaguesinthestatisticalsoftwarepackageStata,has dataavailablefromHES[2].Thisrelianceonlinkagewith the potential to create a complete dataset that combines hospitaladmissionshasimplicationsforcancerincidence, the predictions generated by the above methods, and prevalenceandsurvival.Theuseofcompletecaseanalyses makes an allowance for the imprecision of these predic- inthissituationhasthepotentialtocausebias:forexam- tionsthatiscarriedthroughtothefinalstatisticalanalyses ple,casesofprostatecancerswhoarenottreated(’watchful [14].Itrequirestheusertohaveanaccurateunderstand- waiting’)mayhavelongersurvivalthanthosewhosecancer ing of the reasons why the data are missing (the missing ismore advancedand are therefore admitted to hospital. datamechanism),goodpredictorsofthevalueofthemiss- The exclusion of the former from survival analyses will ingdata,andassumesthatthedataareotherwisemissing tend to underestimate survival overall,and, ifethnicity is at random (MAR). It is, therefore, a combined approach associated with treatment type, may obscure any differ- which aims to maximise accuracy where missing values encesinsurvivalbetweenethnicgroups.Similarly,cancer arepredictedandtoadjusttheprecisionofanyestimates caseswhoareknowntotheWMCIUthroughdeathcertifi- derivedfromthesepredictions(e.g.bywideningtheconfi- cationonly(DCO) (theonlyinformationthey holdabout denceintervalsonestimatesofcancerincidence). thecaseisbasedonadeathcertificate)havelesscomplete We will present information on the sensitivity and linkagewithHES.Anycomparisonofcancerincidenceby specificity of each of these methods, and describe how ethnicgrouprestrictedtocompletecaseshasthepotential these multiple sources were combined in a dataset that toobscureanydifferencesbetweenethnicgroupsifpeople can be used for the estimation of cancer incidence and from ethnic minorities are over represented among the survival by ethnic group. DCOcases. Therearealternativeapproachestodealingwithmissing Methods informationthatallowallcasestoberetainedforanalysis. Population The simplest of these is to link with additional external 111694cancercasesnormallyresidentintheWestMid- information sources that record ethnicity, such as the landsregionoftheUKanddiagnosedbetween1January NationalBreastCancerScreeningService(NBSS),butthe 2001and31December2007wereincludedinthecohort populationcoverage,completenessandaccuracy ofthese forthisproject.Caseswerelimitedtothefivemostcom- data may also be limited. A second commonly used mon cancer sites (breast, upper GI, lower GI, prostate, approachisto uselists of namesthat areassociatedwith andlung)astherewouldbetoofewcasesfromthenon- particular ethnic groups. Name recognition software Whiteethnicgrouptoallowreliablecomparisonsofinci- packagessuchasNamPehchanandSANGRA havebeen denceandsurvivalforthelesscommoncancers. usedinavarietyofsettingstoidentifypeoplewithSouth Asian heritage [3-6]. More recently, another package Ethnic groups called Onomap has become available: this attempts to The five ethnic groupings used were: White, South identify people from many ethnic groups [7,8]. Each of Asian, Black, Chinese/Other, and Mixed. These broad thesepackagesreliesonindividualshavinganamethatis groupings coincide with those used in official national strongly associated with a particular ethnic group and statistics [15]. theirsensitivityandspecificityisknowntovaryfromset- ting to setting [9-11]. None of them can identify people Availability of existing data on ethnicity whowoulddescribethemselves as havinga mixedethni- Data on ethnicity were available for 85 506/111 694 city,individualswhosesurnamesarenotspecifictoethnic (77%) of cancer cases from HES, although some Ryanetal.BMCMedicalInformaticsandDecisionMaking2012,12:3 Page3of8 http://www.biomedcentral.com/1472-6947/12/3 individuals had more than one ethnic group recorded. ordertoassesstheabilityofeachtoidentifytheethnicity Where this occurred (1506/85 506 (1.8%)), we used the ofcancercases. most commonly recorded ethnic group, in line with the method recommended by Downing and colleagues [16]. Prediction of ethnicity using area-basedCensus data This approach is believed to be appropriate as it uses Cases were assigned to the Census area (lower layer most information. We set ethnic group to missing for super output area (LSOA): average size 1500 persons) cases with more than one ethnic group recorded, but associated with their postcode or residence, and linked without a ‘most common’ ethnic group (154/85 506 to a dataset with the ethnic distribution of each LSOA (0.18%) of the complete cases), again in line with the in the 2001 national Census. method used by Downing. We then tabulated the char- acteristics of the cohort, and used a univariate chi- Full multiple imputation model square analysis to identify demographic and clinical fac- Thelaststage inprocessing,theimputationofthe miss- tors associated with missing ethnicity (the missing data ing ethnicity using the existing variables shown to be mechanism) [17,18]. associated with missing ethnicity (Table 1) and external information derived from the above sources was carried Additional source of ethnicity data out in Stata using the MI package ICE [14]. The linked The 28 795 breast cancer cases from the cohort were NBSS datawasuseddirectly to replacethe ethnicgroup linked to data held by the eight breast cancer screening ofcasesnotalreadyknownfromlinkagewithHES,rather services (NBSS) in the region using their NHS number. than as a separate predictor in the imputation model. We assessed the value of this additional information by Where the case’s surname at birth was available (from comparing sensitivity, specificity and positive predictive their death certificate) the Onomap and Nam Pehchan value of the ethnicity recorded in the NBSS using the resultsforthatnamewereusedinplaceoftheresultsfor HES dataset as the gold standard, where both were all known names in the imputation model, as name at available. birth may be a more accurate reflection of ethnicity for individualswhohavechangedtheirnamefollowingmar- Prediction of ethnicity using name recognition software riage.Inaddition,asweintendedtousethedataforcan- Twonamerecognitionapplicationswereavailableforuse cer-specific and all-cause mortality survival analyses, we in this project: Nam Pehchan and Onomap. Nam Peh- included these survival outcomes as covariates in the chan was used to identify people with South Asian imputationmodel, alongwith the time toeach outcome names, and Onomap was used to identify people with (the Nelson-Aalen estimate of the cumulative hazard names associated with White, South Asian, Black and function)[14].Thenumberofimputeddatasetswascho- Chinese/Other ethnic groups. Asearly use ofNam Peh- sen conservatively: one imputed dataset per 1% of cases chanwiththiscohortshowedthatitincludedforenames with any missing data. Missing ethnicity was imputed which were common among other ethnic groups (e.g. using multinomial logistic regression within the ICE ‘Mona’),wedecidedtorunitonforenamesandsurnames package,andthedistributionofimputedvalueswastabu- separately,ratherthanthedefaultapproachwhichwasto lated for comparison with the observed (complete case) run it on all forenames and surnames combined. This data. The sensitivity, specificity, and positive predictive allowedmatchedsurnamestocarryagreaterweightthan value (PPV) of the full multinomial logistic model was matchedforenamesintheMIprocess. compared with the ethnic group recorded in HES in Since Onomap only makes use of a single forename, ordertoassessitsability toidentifytheethnicity ofcan- onlythefirstforenamewasusedwhentheapplicationwas cer cases. The model was developed on a randomly run.However,ascasescouldhavemorethanonesurname selected 50% sample of the 85352 cases whose ethnicity associatedwiththeirWMCIUrecord,theapplicationwas was recorded in the HES dataset. The remaining 50% of runwitheachcombinationofforenameandsurnamefor cases were used to validate the model and derive the each case. If Onomap assigned a case to more than one aboveestimates.The predictorsusedinthemodelwere: broadethnicgroup,theirmultipleresultswerereplacedby ethnicity derived from name recognition software; Cen- asingleresultaccordingtothefollowingorder ofprefer- sus estimates of ethnic distribution of population; num- ence: Chinese/Other, Black, South Asian, White. This berofhospitaladmissions;yearofdiagnosis;patientseen orderingcorrespondswiththerelativesizeofeachgroup outside the NHS (yes/no); screen-detected cancer (yes/ within the regional population, with preference given to no); death certificate only cancer registration (yes/no); thelesscommonethnicgroups.Thesensitivity,specificity, cancertreatmenttype(surgery/radiotherapy/chemother- andpositivepredictivevalue(PPV)ofthetwoapplications apy); deprivation score; gender; age at diagnosis; cancer wascompared withthe ethnic group recorded inHES in site; and death during follow-up period (all-cause and Ryanetal.BMCMedicalInformaticsandDecisionMaking2012,12:3 Page4of8 http://www.biomedcentral.com/1472-6947/12/3 Table 1Characteristics ofthe ProjectCohort Cases Caseswithmissingethnicityfollowing Chi2statistic linkagewithHESrecords(%ofcases) (P-value) Site LowerGI 24446 4112(17) Breast 28795 6029(21) Lung 24060 5660(24) Prostate 23716 8814(37) UpperGI 10677 1727(16) 3500(p<0.001) Yearofdiagnosis 2001 15102 4118(27) 2002 15523 3840(25) 2003 15731 3772(24) 2004 16162 3681(23) 2005 16317 3632(22) 2006 16458 3470(21) 2007 16401 3829(23) 206(p<0.001) Deprivation 1(mostdeprived) 26738 5149(19) (IncomeDomainof 2 22104 4856(22) IndexofMultiple 3 22759 5425(24) Deprivation2007) 4 22465 5942(26) 5(leastdeprived) 17628 4970(28) 6300(p<0.001) Age <40 1771 251(14) 40-49 5622 892(16) 50-59 15338 2933(19) 60-69 27759 5924(21) 70-79 35258 8538(24) 80+ 25946 7804(30) 1100(p<0.001) Sex Male 59592 15454(26) Female 52102 10888(21) 391(p<0.001) DeathCertificate No 106217 23577(22) Onlyregistration Yes 5477 2765(50) 2300(p<0.001) Everseenprivately No 106566 23113(22) (cancerwasdiagnosedortreatedoutsidethefree Yes 5128 3229(63) 4600(p<0.001) NationalHealthServiceatleastononeoccasion) Surgery No 58875 18344(31) Yes 52819 7998(15) 4000(p<0.001) Radiotherapy No 73520 19009(26) Yes 38174 7333(19) 616(p<0.001) Chemotherapy No 93778 24650(26) Yes 17916 1692(9) 2400(p<0.001) Screendetected No 22900 5012(22) breastcancer* Yes 5895 1017(17) 61(p<0.001) HES-linked No 19694 19694(100) Yes 92000 6648(7) Numberofadmissions 0 19694 19694(100) (includesnon-canceradmissions) 1 8012 2071(26) 2 10261 1414(14) 3 10523 985(9) 4 9332 562(6) 5+ 53872 1616(3) 6300(p<0.001)** *Comparisonofbreastcancercaseswhowereandwerenotdetectedbypopulationscreening. **Chi-squaretestexcludescaseswithnoadmissionsas,bydefinition,nonehaveethnicityrecorded. Ryanetal.BMCMedicalInformaticsandDecisionMaking2012,12:3 Page5of8 http://www.biomedcentral.com/1472-6947/12/3 duetoprimarycancerseparately)andtimetodeath/cen- The sensitivity of Onomap is high for White and South soring(Nelson-Aalencumulativehazard). Asianethnicgroups(99.8%and82.1%),butlowforBlack andChinese/Othergroups(4.4%and0.0%).Thesensitiv- Research governance ityofNamPehchanwaslowerthatofOnomapforSouth The project did not require separate ethical approval as Asiancases(71.1%and82.1%),butwhenbothwerecom- it was commissioned by, and carried out in collaboration bined,sensitivitywashigherthaneachindividualapplica- with, the regional cancer registry. Cancer registries have tion (90.5%). A total of 14 615 cases had their name at legal support to collect data relating to cancer under birthrecordedontheirdeathcertificate. Section 251 of the NHS Act 2006 (and formerly under Table 5 shows the sensitivity, specificity and PPV of Section 60 of the Health and Social Care Act 2001). 2001 national Census data on ethnicity as a predictor of [http://www.ukcancassoc.ismysite.co.uk/content/legal- the ethnic group of individual cases. The sensitivity of background#S251]. Census data for the White ethnic group is high (99.3%), butverylowforallotherethnicgroups(lessthan7.4%for Results South Asian cases, 2.3% for Black cases, and 0% for the The completeness of information on the ethnicity of remainingtwogroups). cancer cases following linkage with HES varied signifi- Theethnicityofcasesthatweremissingfollowinglink- cantly (P < 0.001 in each case) by the demographic and age with the HES and NBSS datasets was imputed in clinical factors listed in Table 1. Stata usingICE withanimputationmodel thatincluded The value of linkage with breast cancer screening ser- the variables significantly associated with missingness vices(NBSS)informationonethnicityisshowninTable2 (Table1),thepredictedethnicityofeachcasemadeusing and Table 3. Table 2 describes the sensitivity, specificity Onomap and Nam Pehchan, and the ethnic breakdown and PPV of ethnicity derived from the NBSS compared of the area of residence of the case. The number of withthatrecordedinHES(i.e. usingHESasa goldstan- imputeddatasetsgeneratedforthefullrunwassetto23 dard), for 5243 breast cancer cases with ethnic group asethnicitywasmissingfor22.6%ofthecases(Table3). recordedinbothHESandNBSS datasets.Sensitivitywas Table6showsthesensitivity,specificityandPPVofthe high(>90%)forWhiteandSouthAsiancases,andmod- fullmultinomiallogisticregressionmodelusedtoimpute erately high (61.4%) for Black cases, suggesting that the missingethnicity.Thesensitivityandspecificityofthefull NBSScouldbeused to determinethe ethnicityof cancer modelwascomparabletothatfromthenamerecognition registry cases where this was not recorded in HES. No software alone for the White group (99.3%/56.0% vs. cases recorded as Chinese/Other or Mixed ethnicity in 99.8%/51.5%, respectively). The sensitivity of the full HES were assigned the same ethnic group in the NBSS modelwasslightlyhigherforcasesfromtheSouthAsian dataset.However,thevalueoftheNBSSrecordsforthese group than name recognition software alone (94.7% vs. twoethnicgroupscannotbepreciselydeterminedbecause 90.5%,respectively),andsubstantiallyhigherforBlackand of the small numbers involved (14 and 10 cases, respec- Chinese/Otherethnicgroups(20.4%vs.2.3%and21%vs. tively).Table3showstheeffectofusingtheNBSSdatato 0%, respectively).The sensitivityofthefullmodel forthe resolve the ethnicity of registry cases that were not Mixedethnicgroupremainedat0%. recordedinHES.Atotalof1082/26342(4.1%)breastcan- Table7comparestheproportionofcasesineachethnic cer cases whose ethnicity was not known in HES had an groupforcompleteandimputedcases(all23imputations ethnic group recorded in the NBSS. Overall it decreased combined). The proportion of cases in the White, South the proportion of cancer cases with unknown ethnicity Asian and Black groups was slightly lower among the from23.6%to22.6%. imputed cases than the complete cases (95.8% vs. 96%, The sensitivity, specificity and PPV of Onomap and 1.7% vs. 1.8%, and 1.6% vs. 1.7%, respectively). For the NamPehchanforeachethnicgroupisshowninTable4. remaining ethnic groups, the proportion of cases each Table 2Sensitivity, Specificity and Positive Predictive Value ofNBSS-derived Ethnicity forBreast Cancer Cases EthnicgrouprecordedinHES NumberofcasesrecordedinHES NBSS Sensitivity Specificity Positivepredictivevalue White 5093 99.7% 77.3% 99.3% SouthAsian 82 90.0% 99.8% 87.1% Black 44 61.4% 99.9% 79.4% Chinese/Other 14 0.0% 100.0% Mixed 10 0.0% 100.0% Includes5243caseswhereethnicgroupwasrecordedinbothHESandNBSSdatasets.Individuallogisticmodels(positiveoutcomethreshold:p>=0.5). Ryanetal.BMCMedicalInformaticsandDecisionMaking2012,12:3 Page6of8 http://www.biomedcentral.com/1472-6947/12/3 Table 3Ethnicity ofCases FollowingLinkage withHES andNBSSDatasets Ethnicgroup Caseswithethnicgroup CaseswithunknownethnicityresolvedbyNBSSlinkage Ethnicbreakdownofcohort recordedinHES(%) followinguseofHESandNBSS(%) White 81934 (73.4) 1053 82987 (74.3) SouthAsian 1545 (1.4) 13 1558 (1.4) Black 1429 (1.3) 11 1440 (1.3) Chinese/Other 303 (0.3) 3 306 (0.3) Mixed 141 (0.1) 2 143 (0.1) Notknown 26342 (23.6) 26331 (22.6) Total 111694 1082 group was approximately 1.5 times higher among the ethnicityratherthanincludingitasapredictorofethni- imputed cases (0.6% vs. 0.4%, and 0.3% vs. 0.2% for Chi- city in the MIprocess: it didrefer directly to the person nese/OtherandMixedethnicgroups,respectively.) of interest. Similar datasets were not available for the othercancersitesofinterest. Discussion TheperformanceofNamPehchaniswidelyknownbut, The main aim of this project was to create a method to asfarasweareaware,thisisthefirstpeerreviewedreport imputetheethnicgroupofcancercaseswhowerenotified on the Onomap application. The higher sensitivity and to the regional cancer registry, but whose ethnic group specificity that was achieved by using both applications wasnotavailablefromtheirmainsource,linkagewiththe togethersuggeststhatthebestname-basedpredictionsof nationaldatabaseonhospitaladmissions(HES).Wemade ethnicity can be achieved by the use of multiple applica- use of precise external information on the ethnicity of tions.It is, however,unlikely that namerecognition soft- caseswherepossible,throughlinkagewithafurtherdata- ware will ever precisely predict membership of some set (the NBSS), two name recognition applications, and ethnic groups: although many people from South Asian, area-basedinformationontheethnicmake-upoftheresi- Chineseandsomeother ethnicgroupsmayhavedistinc- dent population. We then assessed the value of each of tive names, many individuals from White, Black and theseadditionalsourcesbycomparingthemwiththeeth- Mixed ethnic groups do not. This suggests that we will nicgroupofcaseswhoseethnicitywasknownfromHES. alwayshavetomakesomeallowancefortheirimprecision, In the final stage of the method, we created a dataset andincludeadditionalpredictorsofethnicity. whichcanbeusedtoestimateethnicgroupspecificcancer Area-based information from the national Census is a incidenceandsurvival:thisinvolvedtheuseofaMIpro- popular predictor of ethnicity and easy to implement, cedure(ICE). but this project demonstrates that is not precise enough The main benefit of using additional linked datasets, to be used alone. Although sensitivity for the White eth- liketheNBSS,isthatitmakesuseofpreciseinformation nic group may be high (> 99%) specificity is very low recordedabouttheindividualsofinterest.Themainlim- (21%), showing that it misclassifies approximately 4 out itationforthisprojectisthattheNBSSdatasetonlycon- of every 5 people from non-White backgrounds as tains breast cancer cases who attended the screening White. Similarly, sensitivity is poor for the most com- programme andwho had theirethnic grouprecordedat mon non-White ethnic groups in the region: the use of thattime:thisresolvedtheethnicgroupofjust1%ofthe area-based Census data only appears to correctly iden- cancer cases whose ethnicity was not already known tify approximately 7% of South Asian and 2% of Black from linkage with HES. We decided to use NBSS cases. Setting alternative cutoff values for the model pre- recorded ethnicity as a direct substitute for missing dictions from the default value of 0.5 did not improve Table 4Sensitivity, Specificity and Positive Predictive Value ofNameRecognition Software Namerecognitionsoftware Ethnicgroup Sensitivity Specificity Positivepredictivevalue Onomap White 99.8% 51.5% 98.0% SouthAsian 82.1% 99.9% 92.9% Black 4.4% 99.9% 70.8% Chinese/Other 0.0% 100.0% NamPehchan* SouthAsian 71.1% 99.9% 94.5% OnomapandNamPehchancombined SouthAsian 90.5% 99.9% 93.3% Includes85352caseswhereasingleethnicgroupwasrecordedintheHESdataset.Individuallogisticmodels(positiveoutcomethreshold:p>=0.5). *Matchedonforenameandsurnameseparately. Ryanetal.BMCMedicalInformaticsandDecisionMaking2012,12:3 Page7of8 http://www.biomedcentral.com/1472-6947/12/3 Table 5Sensitivity, Specificity and Positive Predictive Table 7Comparison ofDistribution ofEthnic Groups: Value ofCensus Data onEthnicity Observedand Imputed Ethnicgroup Sensitivity Specificity Positivepredictivevalue Ethnicgroup Observed% Imputed*% Total*% White 99.3% 21.4% 96.8% White 96.0 95.8 96.0 SouthAsian 7.4% 99.8% 44.9% SouthAsian 1.8 1.7 1.8 Black 2.3% 99.9% 34.4% Black 1.7 1.6 1.7 Chinese/Other 0.0% 100.0% Chinese/Other 0.4 0.6 0.4 Mixed 0.0% 100.0% Mixed 0.2 0.3 0.2 Includes85352caseswhereasingleethnicgroupwasrecordedintheHES Total 100 100 100 dataset.Censusdatausedtopredictethnicgroupwere:percentageoflocal *Usingall23imputationscombined. populationinSouthAsian,Black,Chinese/OtherandMixedethnicgroupat lastnationalcensus.Individuallogisticmodels(positiveoutcomethreshold:p >=0.5). identifymembershipofthesegroupsmoreprecisely,using age-specificdata ontheethniccompositionofsmallgeo- the predictive performance of the Census data to any graphic areas (LSOAs) if this is published following the great extent (results not shown). nextnationalCensusorcountryofbirth,forexample. The full model used to predict ethnicity within the MI procedure did appear to be superior to naming software Conclusions and Census data alone. The model appeared to perform In summary, we have developed a method and dataset best for the South Asian ethnic group, and did identify that will allow comparison of cancer incidence and sur- membershipoftheWhite,BlackandChinese/Othereth- vival between ethnic groups. However, the sensitivity of nic groups with greater sensitivity than any of its indivi- the Onomap name recognition application appears to be dualconstituentpredictors.However,thesensitivityofthe low for people from non-White and non-South Asian fullmodelinabsolutetermsforBlack,Chinese/Otherand ethnic groups, suggesting that it is of limited use for Mixed groups is low. It is, therefore, uncertain that the studies that wish to classify individuals by ethnic group. existingpredictors can be improved,or that new predic- Similarly, area-based information from the national Cen- torscouldbeadded, whichwouldincreasethesensitivity sus, a common approach where individual names are of future models to levels similar to that seen for the not available but area of residence is, appears to be SouthAsiangroup. imprecise, particularly for the less common ethnic The greatest difference between the observed and groups. Currently, neither method for assigning indivi- imputed data in the final model was for two ofthe three duals to an ethnic group performs well across all ethnic minority groups whose ethnicity is poorly predicted by groups. We recommend further development of name name recognition software (Chinese/Other and Mixed). recognition applications and the identification of addi- This may indicate that the imputation process does not tional methods for predicting ethnicity to improve their performwellinthesecases.New MI modelsmay benefit precision and accuracy for comparisons of health out- from including other predictors of ethnicity which help comes. Nevertheless, neither imputation nor name recognition software will be completely accurate: reliable statistics relating to the incidence, prevalence and survi- Table 6Sensitivity, Specificity and Positive Predictive val of persons with cancer by ethnic group require more Value ofFull Model complete recording of these data [19]. Ethnicgroup Sensitivity Specificity Positivepredictivevalue White 99.7% 56.0% 98.2% Acknowledgements SouthAsian 94.7% 99.8% 90.4% ThisprojectwassupportedbytheWestMidlandsCancerIntelligenceUnit Black 20.4% 99.8% 63.6% (WMCIU).ItwascommissionedbytheWMCIUwhichprovidedaccesstothe Chinese/Other 21.0% 99.9% 57.6% dataandfundingforRRtocarryouttheproject.GListheDirectorofthe WMCIUandSVwastheDeputyDirectorofCancerRegistrationatthe Mixed 0% 100% WMCIUatthetimeofthestudy:theycontributedtothewritingofthe Amultinomiallogisticregressionmodelwasusedtopredictethnicgroup.The manuscriptandtothedecisiontosubmitthemanuscriptforpublication.RR modelwasdevelopedonarandomlyselected50%sampleofthe85352cases andSWareemployedbytheUniversityofBirmingham. whoseethnicitywasrecordedintheHESdataset.Theremaiming50%of caseswereusedtovalidatethemodelandderivetheaboveestimates.The Authordetails predictorsusedinthemodelwere:ethnicityderivedfromnamerecognition 1PublicHealth,EpidemiologyandBiostatistics,UniversityofBirmingham, software;Censusestimatesofethnicdistributionofpopulation;numberof BirminghamB152TTUK.2EasternCancerRegistrationandInformation hospitaladmissions;yearofdiagnosis;patientseenoutsidetheNHS(yes/no); Centre,PublicHealthBuilding,UnitC-MagogCourt,ShelfordBottom, screen-detectedcancer(yes/no);deathcertificateonlycancerregistration HintonWay,Cambridge,CB223ADUK.3WestMidlandsCancerIntelligence (yes/no);cancertreatmenttype(surgery/radiotherapy/chemotherapy); Unit,PublicHealthBuilding,UniversityofBirmingham,BirminghamB152TT deprivationscore;gender;ageatdiagnosis;cancersite;anddeathduring UK.4PrimaryCareClinicalSciences,UniversityofBirmingham,Birmingham follow-upperiod(all-causeandduetoprimarycancerseparately)andtimeto B152TTUK. death/censoring(Nelson-Aalencumulativehazard). Ryanetal.BMCMedicalInformaticsandDecisionMaking2012,12:3 Page8of8 http://www.biomedcentral.com/1472-6947/12/3 Authors’contributions 19. IqbalG,GumberA,SzczepuraA,JohnsonM,WilsonS,DunnJ:Improving Allauthorscontributedtothedesignoftheproject.RRledthe ethnicitydatacollectionforcancerstatisticsintheUK.DiversityinHealth developmentofthemethodforimputingmissingdata,performedthe andCare2009,16:267-285. analysis,andwrotethefirstdraftofthemanuscript.Allauthorscontributed tolaterdraftsandapprovedthefinalmanuscript. Pre-publicationhistory Thepre-publicationhistoryforthispapercanbeaccessedhere: Competinginterests http://www.biomedcentral.com/1472-6947/12/3/prepub Theauthorsdeclarethattheyhavenocompetinginterests. doi:10.1186/1472-6947-12-3 Received:18May2011 Accepted:23January2012 Citethisarticleas:Ryanetal.:Useofnamerecognitionsoftware, Published:23January2012 censusdataandmultipleimputationtopredictmissingdataon ethnicity:applicationtocancerregistryrecords.BMCMedicalInformatics andDecisionMaking201212:3. References 1. HospitalEpisodesStatistics.[http://www.hesonline.nhs.uk],accessed01 Sept2010. 2. NationalCancerIntelligenceNetworkCoordinatingCentre:Cancerincidence andsurvivalbymajorethnicgroup,England,2002-2006[http://publications. cancerresearchuk.org/downloads/Product/CSINCSURVBYETHNICITY.pdf], accessed13Jan2012. 3. CumminsC,WinterH,ChengKK,MaricR,SilcocksP,VargheseC:An assessmentoftheNamPehchancomputerprogramforthe identificationofnamesofsouthAsianethnicorigin.JPublicHealthMed 1999,21:401-6. 4. PriceCL,SzczepuraAK,GumberAK,PatnickJP:Comparisonofbreastand bowelcancerscreeninguptakepatternsinacommoncohortofSouth AsianwomeninEngland.BMCHealthServicesResearch2010,10:103. 5. SzczepuraA,PriceCL,GumberA:Breastandbowelcancerscreening uptakepatternsover15yearsforUKSouthAsianethnicminority populations,correctedfordifferencesinsocio-demographic characteristics.BMCPublicHealth2008,8:1471-2458. 6. NanchahalK,MangtaniP,AlstonM,dosSantosSilvaI:Developmentand validationofacomputerizedSouthAsianNamesandGroup RecognitionAlgorithm(SANGRA)foruseinBritishhealth-relatedstudies. JPublicHealthMed2001,23:278-85. 7. OnoMAP.[http://www.onomap.org],accessed5Oct2010. 8. TheNamesProjects.[http://redress.lancs.ac.uk/resources/launch.php? creator=MateosPablo&title=TheNamesProjects],accessed13Jan2012. 9. MateosP:Areviewofname-basedethnicityclassificationmethodsand theirpotentialinpopulationstudies.PopulationSpaceandPlace2007, 13:243-263. 10. BrantLJ,BoxallE:Theproblemwithusingcomputerprogrammesto assignethnicity:Immigrationdecreasessensitivity.PublicHealth2009, 123:316-320. 11. IdentifyingEthnicity:comparisonoftwocomputerprogrammes.[http:// www2.warwick.ac.uk/fac/med/research/csri/ethnicityhealth/ aspects_diversity/identifying_ethnicity/identifying_ethnicity.doc]. 12. RenshawC,JackRH,DixonS,MøllerH,DaviesEA:Estimatingattendance forbreastcancerscreeninginethnicgroupsinLondon.BMCPublic Health2010,10:157. 13. Hippisley-CoxJ,CouplandC,VinogradovaY,RobsonJ,MayM,BrindleP: DerivationandvalidationofQRISK,anewcardiovasculardiseaserisk scorefortheUnitedKingdom:prospectiveopencohortstudy.BMJ2007, 335:136. 14. RoystonP,CarlinJB,WhiteIR:Multipleimputationofmissingvalues:New featuresformim.StataJournal2009,9:252-264. 15. PopulationEstimatesbyEthnicGroup:MethodologyPaper.[http://www. ons.gov.uk/ons/rel/peeg/population-estimates-by-ethnic-group– experimental-/current-estimates/population-estimates-by-ethnic-group- Submit your next manuscript to BioMed Central methodology-paper.pdf],accessed13Jan2012. 16. DowningA,FormanD,ThomasJD,WestRM,LawrenceG,GilthorpeMS: and take full advantage of: Investigatingtheassociationbetweenethnicityandsurvivalfrombreast cancerusingroutinelycollectedhealthdata:challengesandpotential • Convenient online submission solutions[abstract].JournalofEpidemiologyandCommunityHealth2009, 63:88. • Thorough peer review 17. vanBuurenS,BoshuizenHC,KnookDL:Multipleimputationofmissing • No space constraints or color figure charges bloodpressurecovariatesinsurvivalanalysis.StatisticsinMedicine1999, • Immediate publication on acceptance 18:681-694. 18. SterneJA,WhiteIR,CarlinJB,SprattM,RoystonP,KenwardMG,WoodAM, • Inclusion in PubMed, CAS, Scopus and Google Scholar CarpenterJR:Multipleimputationformissingdatainepidemiological • Research which is freely available for redistribution andclinicalresearch:potentialandpitfalls.BMJ2009,338:b2393. Submit your manuscript at www.biomedcentral.com/submit