ebook img

Design and Evaluation of a Data-Driven Password Meter PDF

12 Pages·2017·0.64 MB·English
by  
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Design and Evaluation of a Data-Driven Password Meter

Passwords and Authentication CHI 2017, May 6–11, 2017, Denver, CO, USA Design and Evaluation of a Data-Driven Password Meter BlaseUr*,FeliciaAlfieri,MaungAung,LujoBauer,NicolasChristin, JessicaColnago,LorrieFaithCranor,HenryDixon,PardisEmamiNaeini, HanaHabib,NoahJohnson,WilliamMelicher *UniversityofChicago,CarnegieMellonUniversity [email protected] {fla,mza,lbauer,nicolasc,jcolnago,lorrie,hdixon,pardis,hana007,noah,billy}@cmu.edu ABSTRACT thestrengthofapasswordthanotheravailablemetersandpro- Despitetheirubiquity,manypasswordmetersprovideinac- videsmoreuseful,actionablefeedbacktousers.Whereasmost curatestrengthestimates. Furthermore,theydonotexplain previousmetersscoredpasswordsusingverybasicheuristics to users what is wrong with their password or how to im- [10,42,52],weusethecomplementarytechniquesofsimulat- prove it. We describe the development and evaluation of a ingadversarialguessingusingartificialneuralnetworks[32] data-driven password meter that provides accurate strength andemploying21heuristicstoratepasswordstrength. Our measurementandactionable,detailedfeedbacktousers. This meteralsogivesusersactionable,data-drivenfeedbackabout metercombinesneuralnetworksandnumerouscarefullycom- howtoimprovetheirspecificcandidatepassword. Weprovide binedheuristicstoscorepasswordsandgeneratedata-driven userswithuptothreewaysinwhichtheycouldimprovetheir text feedback about the user’s password. We describe the passwordbasedonthecharacteristicsoftheirspecificpass- meter’siterativedevelopmentandfinaldesign. Wedetailthe word.Furthermore,weautomaticallyproposemodificationsto securityandusabilityimpactofthemeter’sdesigndimensions, theuser’spasswordthroughjudiciousinsertions,substitutions, examinedthrougha4,509-participantonlinestudy. Underthe rearrangements,andcasechanges. more common password-composition policy we tested, we Inthispaper,wedescribeourmeterandtheresultsofa4,509- foundthatthedata-drivenmeterwithdetailedfeedbackled participant online study of how different design decisions userstocreatemoresecure,andnolessmemorable,passwords impactedthesecurityandusabilityofpasswordsparticipants thanameterwithonlyabarasastrengthindicator. created. Wetestedtwopassword-compositionpolicies,three scoringstringencies,andsixdifferentlevelsoffeedback,rang- ACMClassificationKeywords ingfromnofeedbackwhatsoevertoourfull-featuredmeter. K.6.5 Security and Protection: Authentication; H.5.2 User Interfaces: Evaluation/methodology Under the more common password-composition policy we tested,wefoundthatourdata-drivenmeterwithdetailedfeed- AuthorKeywords backleduserstocreatemoresecurepasswordsthanameter Passwords;usablesecurity;data-driven;meter;feedback withonlyabarasastrengthindicatorornothavinganyme- ter,withoutasignificantimpactonanyofourmemorability INTRODUCTION metrics. Mostparticipantsreportedthatthetextfeedbackwas Passwordmetersareusedwidelytohelpuserscreatebetter informativeandhelpedthemcreatestrongerpasswords. passwords [42], yet they often provide ratings of password strength that are, at best, only weakly correlated to actual RELATEDWORK passwordstrength[10]. Furthermore,currentmetersprovide Userssometimesmakepredictablepasswords[22,30,48]even minimalfeedbacktousers. Theymaytellauserthathisor forimportantaccounts[13,31]. Manyusersbasepasswords herpasswordis“weak”or“fair”[10,42,52],buttheydonot around words and phrases [5,23,29,45,46]. When pass- explainwhattheuserisdoingwronginmakingapassword, words contain uppercase letters, digits, and symbols, they nordotheyguidetheusertowardsabetterpassword. areofteninpredictablelocations[4]. Keyboardpatternslike Inthispaper,wedescribeourdevelopmentandevaluationof “1qaz2wsx” [46] and dates [47] are common in passwords. anopen-sourcepasswordmeterthatismoreaccurateatrating Passwordssometimescontaincharactersubstitutions,suchas replacing “e” with “3” [26]. Furthermore, users frequently reusepasswords[9,14,25,38,48],givingthecompromiseof evenasingleaccountpotentiallyfar-reachingrepercussions. Permissiontomakedigitalorhardcopiesofpartorallofthisworkforpersonalor In designing our meter, we strove to help users understand classroomuseisgrantedwithoutfeeprovidedthatcopiesarenotmadeordistributed whentheirpasswordexhibitedthesecommontendencies. forprofitorcommercialadvantageandthatcopiesbearthisnoticeandthefullcitation onthefirstpage.Copyrightsforthird-partycomponentsofthisworkmustbehonored. Threetypesofinterventionsattempttoguideuserstowards Forallotheruses,contacttheOwner/Author. Copyrightisheldbytheowner/author(s). strongpasswords. First,password-compositionpoliciesdic- CHI2017,May06–11,2017,Denver,CO,USA tatecharacteristicsapasswordmustinclude,suchasparticular ACM978-1-4503-4655-9/17/05. http://dx.doi.org/10.1145/3025453.3026050 3775 Passwords and Authentication CHI 2017, May 6–11, 2017, Denver, CO, USA character classes. While these policies can be effective for sioncomparingthesescorestothepassword’sguessability,as security,usersoftenfindcomplexcompositionpoliciesunus- modeledbyCMU’sPasswordGuessabilityService[7]. This able[2,20,24,28,37,50].Second,proactivepasswordchecking servicemodelsfourtypesofguessingattacksandhasbeen aimstomodelapassword’ssecurityandonlypermitusersto foundtobeaconservativeproxyforanexpertattacker[44]. select a password the model deems sufficiently strong. For Although these carefully combined heuristics also provide instance,researchershaveproposedusingserver-sideMarkov relatively accurate password-strength estimates, at least for modelstogaugepasswordstrength[8]. Thisapproachisinap- resistance to online attacks [52], we use them primarily to propriatewhenthepasswordshouldneverbesenttoaserver identify characteristics of the password that are associated (e.g., encrypting a hard drive). It also requires non-trivial withguessability. Ifacandidatepasswordscoreshighona configurationandcanenableside-channelattacks[35]. particularheuristic,indicatingthepresenceofacommonpat- Third,servicescommonlyprovidepasswordmeterstoshow tern,wegeneratetextfeedbackthatexplainswhatiswrong estimatedpasswordstrength. Toenablethesemeterstorun withthataspectofthepasswordandhowtoimproveit. Wede- client-sideandthusavoidthesecuritypitfallsofaserver-side velopedthewordingsforthisfeedbackiterativelyandthrough solution,mostmetersusebasicheuristics,suchasestimating aformativelabstudy.1 apassword’sstrengthbasedonitslengthandthenumberof character classes used [42]. These sorts of meters can suc- VISUALDESIGNANDUSEREXPERIENCE cessfullyencourageuserstocreatestrongerpasswords[42], In this section, we describe the visual design of our meter. thoughperhapsonlyforhigher-valueaccounts[12]. Unfor- Atahighlevel, themetercomprisesthreedifferentscreens. tunately, these basic heuristics frequently do not reflect the Themainscreenusesthevisualmetaphorofabartodisplay actualstrengthofapassword[1,10]. Amongpriorpassword the strength of the password, and it also provides detailed, metersbasedon heuristics, onlyzxcvbn [51,52]usesmore data-drivenfeedbackabouthowtheusercanimprovehisor advancedheuristics[10,32]. Akeydifferencefromzxcvbn her password. The main screen also contains links to the in the design of both our meter and our experiment is that twootherscreens. Userswhoclick“(Why?)” linksnextto generatingandtestingtheimpactoffeedbacktotheuseris the feedback about their specific password are taken to the primaryforus. specific-feedbackmodal,whichgivesmoredetailedfeedback. Users who click the “How to make strong passwords” link Researchershavetriednumerousvisualdisplaysofpassword or“(Why?)” linksadjacenttofeedbackaboutpasswordreuse strength. Using a bar is most common [42]. However, re- aretakentothegeneric-advicemodal,whichisastaticlistof searchershavestudiedusinglarge-scaletrainingdatatoshow abstractstrategiesforcreatingstrongpasswords. userspredictionsofwhattheywilltypenext[27]. Othershave investigatedapeer-pressuremeterthatcomparesthestrength of a user’s password with those of other users [36]. These TranslatingScorestoaVisualBar alternativevisualizationshaveyettobewidelyadopted. Password-strengthmeasurementsarenormallydisplayedto usersnotasanumericalscore,butusingacoloredbar[10,42]. Increatingourmeter,weneededtomapapassword’sscores MEASURINGPASSWORDSTRENGTHINOURMETER fromboththeneuralnetworkandcombinedheuristicstothe We move beyond measuring password strength using inac- amount of the bar that should be filled. We conservatively curatebasicheuristicsbycombiningtwocomplementaryap- calculated log of the lower of the two estimates for the 10 proaches: modeling a principled password-guessing attack numberofguessesthepasswordwouldwithstand. usingneuralnetworksandcarefullycombining21heuristics. Prior work has shown that most users consider a password Ourfirstapproachreliesonrecentworkthatproposedneural sufficientlystrongifonlypartofthebarisfull[42]. Therefore, networksformodelingapassword-guessingattack[32]. This we mapped scores to the bar such that one-third of the bar approachusesarecurrentneuralnetworktoassignprobabil- beingfilledroughlyequatestoacandidatepasswordresisting ities to future characters in a candidate guess based on the an online attack, while two-thirds means that the candidate previouscharacters. Initstrainingphase,thenetworklearns passwordwouldlikelyresistanofflineattack,assumingthata varioushigher-orderpasswordfeatures. UsingMonteCarlo hashfunctiondesignedforpasswordstorageisused.Wetested methods[11],suchanapproachcanmodel1030 ormoread- threedifferentprecisemappings,whichwetermstringencies. versarialguessesinrealtimeentirelyontheclientsideafter transferringlessthanamegabyteofdatatotheclient[32]. MainScreen However,neuralnetworksareeffectivelyablackboxandpro- On the main screen a bar below the password field fills up videnoexplanationtotheuserfortheirscoring. Inspiredby and changes color to indicate increasing password strength. thezxcvbnpasswordmeter[52],weimplementedandcare- Differentfrompreviousmeters,ourmeteralsodisplaysdata- fully combined 21 heuristics for password scoring. These driventextfeedbackaboutwhataspectsoftheuser’sspecific heuristicssearchforcharacteristicssuchastheinclusionof passwordcouldbeimproved. common words and phrases, the use of common character The meter initially displays text listing the requirements a substitutions, the placement of digits and uppercase letters passwordmustmeet.Oncetheuserbeginstoenterapassword, in common locations, and the inclusion of keyboard pat- terns. We scored 30,000 passwords from well-studied data 1SeeCh.7ofUr’sdissertation[40]fordetailsonthedevelopmentof breaches [18,45] by these heuristics, and we ran a regres- thiswording,theresultsofthelabstudy,andtheheuristics. 3776 Passwords and Authentication CHI 2017, May 6–11, 2017, Denver, CO, USA Figure2: Asuggestedimprovementfor“Mypassword123,” with changes in magenta. The suggested improvement and sensitivefeedbackappearwhenusersshowtheirpassword. Figure1:Thetool’smainscreenwhenthepasswordishidden. correlated most strongly with password guessability in our regressioncombiningtheheuristics. thetoolindicatesthataparticularrequirementhasbeenmet by coloring that requirement’s text green and displaying a SuggestedImprovement checkmark. Itdenotesunmetrequirementsbycoloringthose To make it easier for users to create a strong password, we requirementsredanddisplayingemptycheckboxes. Untilthe augmentedourtextfeedbackwithasuggestedimprovement, passwordmeetsrequirements,thebarisgray. orconcretemodificationofthecandidatepassword. Asshown inFigure2,asuggestedimprovementfor“Mypassword123” ColoredBar mightbe“My123passwoRzd.” Oncethepasswordmeetscompositionrequirements,wedis- playthebarincolor. Withincreasingpassword-strengthrat- We generate this suggested improvement as follows. First, ings,thebarprogressesfromredtoorangetoyellowtogreen. wetaketheuser’scandidatepasswordandmakeoneofthe Whenitisone-thirdfull,thebarisdarkorange. Attwo-thirds following modifications: toggle the case of a random letter full,itisyellow,soontobegreen. (lowercasetouppercase,orviceversa),insertarandomchar- acterinarandomposition,orreplaceacharacteratarandom TextFeedback positionwitharandomlychosencharacter. Inaddition,ifall Whereasacoloredbaristypicalofpasswordmeters[42],our of the password’s digits or symbols are in a common loca- meterisamongthefirsttoprovidedetailed,data-drivenfeed- tion(e.g.,attheend),wemovethemasagrouptoarandom backonhowtheusercanimprovehisorherspecificcandidate location within the password. We choose from among this password. Wedesignedthetextfeedbacktobedirectlyaction- largesetofmodifications,ratherthanjustmakingmodifica- able,inadditiontobeingspecifictothepassword. Examples tions corresponding to the specific text feedback the meter ofthisfeedbackinclude,asappropriate,suggestionstoavoid displays, togreatlyincreasethespaceofpossiblemodifica- dictionarywordsandkeyboardpatterns,tomoveuppercase tions. Wethenverifythatthismodificationstillcomplieswith lettersawayfromthefrontofthepasswordanddigitsaway thecompositionrequirements. fromtheend,andtoincludedigitsandsymbols. Werequirethatsuggestedimprovementswouldfillatleasttwo- Mostfeedbackcommentsonspecificpartsofthepassword. thirdsofthebarandalsobeatleast1.5ordersofmagnitude Becauseuserslikelydonotexpecttheirpasswordtobeshown hardertoguessthantheoriginalcandidatepassword. When on screen, we designed public and sensitive variants for all we have generated such a suggested improvement and the feedback. Publicversionsmentiononlythegeneralclassof useriscurrentlyshowinghisorherpassword,wedisplaythe characteristic(e.g.,“avoidusingkeyboardpatterns”),whereas suggestedmodificationonscreenwithchangesinmagenta,as sensitiveversionsalsodisplaytheproblematicportionofthe inFigure2. Iftheuserisnotshowinghisorherpassword,we password (e.g., “avoid using keyboard patterns like adgjl”). insteadshowabluebuttonstating“seeyourpasswordwith Wedisplaythepublicversionsoffeedback(Figure1)when ourimprovements.” Priorworkhasinvestigatedautomatically usersarenotshowingtheirpasswordonscreen,whichisthe modifyingpasswordstoincreasesecurity,findinguserswould defaultbehavior. Weprovidecheckboxeswithwhichauser makeweakerpre-improvementpasswordstocompensate[16]. can“showpassword&detailedfeedback,”atwhichpointwe Incontrast,oursuggestedimprovementsareoptional. usethesensitiveversion(Figure2). Toavoidoverwhelmingtheuser,weshowatmostthreepieces Specific-FeedbackModal of feedback at a time. Each of our 21 heuristic functions For the main screen, we designed the feedback specific to returnseitheronesentenceoffeedbackortheemptystring. a user’s password to be both succinct and action-oriented. Tochoosewhichfeedbacktodisplay, wemanuallyordered Althoughtherationaleforthesespecificsuggestionsmightbe thefunctionstoprioritizethosethatwefoundinaformative obvioustomanyusers,weexpecteditwouldnotbeobvious laboratorystudyprovidedthemostusefulinformationandthat toallusersbasedonpriorworkonpasswordperceptions[41, 3777 Passwords and Authentication CHI 2017, May 6–11, 2017, Denver, CO, USA tacks [19,39], our second point recommends using at least 12charactersinthepassword. Tohelpinspireuserswhoare unfamiliarwithhowtomakea12+characterpassword,we use the third point to propose a way of doing so based on Schneier’s method to use mnemonics or fragments from a uniquesentenceasthebasisforapassword[33]. Thefourth pointrecommendsagainstcommonpasswordcharacteristics. METHODOLOGY We recruited participants from Amazon’s Mechanical Turk crowdsourcingserviceforastudyonpasswords. Werequired thatparticipantsbeage18+andbelocatedintheUnitedStates. Inaddition,becausewehadonlyverifiedthatthemeterworked correctlyonFirefox,Chrome/Chromium,Safari,andOpera, werequiredtheyuseoneofthosebrowsers. In order to measure both password creation and password recall, the study comprised two parts. The first part of the study,whichwetermPart1,includedapasswordcreationtask, Figure3: Thespecific-feedbackmodal(passwordshown). asurvey,andapasswordrecalltask. Weassignedparticipants round-robintoaconditionspecifyingthemetervariantthey wouldseewhencreatingapassword. Thesecondpartofthe 43]. Togivedetailedexplanationsofwhyspecificsuggestions study, which we term Part 2, took place at least 48 hours would improve a password, we created a specific-feedback afterthefirstpartofthestudyandincludedapasswordrecall modal,showninFigure3. Whenauserclicks“(Why?)” next task and a survey. We compensated participants $0.55 for topassword-specificfeedback,themodalappears. completingPart1and$0.70forPart2. Thespecific-feedbackmodal’smainbulletpointsmirrorthose Afterstudying18conditionsinourfirstexperiment,wehad fromthemainscreen. Beloweachmainpoint,however,the lingeringquestions.Wethereforeranasecondexperimentthat specific-feedbackmodalalsoexplainswhythisactionwould added8newconditionsandrepeated4existingconditions. improve the password. Our explanations take two primary forms. Thefirstformisexplanationsofhowattackerscould exploit particular characteristics of the candidate password. Part1 Followingtheconsentprocess,wetoldparticipantsthatthey Forinstance,ifapasswordcontainssimpletransformations wouldbecreatingapassword.Weaskedthattheyroleplayand of dictionary words, we explain that attackers often guess imaginethattheywerecreatingthispasswordfor“anaccount passwordsbytryingsuchtransformations. Thesecondform theycarealotabout,suchastheirprimaryemailaccount.” We ofexplanationisstatisticsabouthowcommondifferentchar- informedparticipantstheywouldbeinvitedbackinafewdays acteristicsare(e.g.,30%ofpasswordsthatcontainacapital torecallthepasswordandaskedthemto“takethestepsyou letterhaveonlythefirstcharacterofthepasswordcapitalized). wouldnormallytaketocreateandrememberyourimportant passwords,andprotectthispasswordasyounormallywould Generic-AdviceModal protectyourimportantpasswords.” Itisnotalwayspossibletogeneratedata-drivenfeedback. For instance, until a user has typed more than a few characters Theparticipantthencreatedausernameandapassword.While intothepasswordfield,thestrengthofthecandidatepassword doingso,heorshesawthepassword-strengthmeter(orlack cannotyetbedefinitivelydetermined. Inaddition,extremely thereof)dictatedbyhisorherassignedcondition,described predictablecandidatepasswords(e.g.,“password”or“mon- below. Participantsthenansweredasurveyabouthowthey key1”)requirethatuserscompletelyrethinktheirstrategy. We createdthatpassword. Wefirstaskedparticipantstorespond thuscreatedageneric-advicemodalthatrecommendsabstract ona5-pointscale(“stronglydisagree,”“disagree,”“neutral,” strategies for creating passwords. Users access this modal “agree,”“stronglyagree”)tostatementsaboutwhethercreating byclicking“howtomakestrongpasswords”orbyclicking a password was “annoying,” “fun,” or “difficult.” We also “(Why?)” next to suggestions against password reuse. The askedwhethertheyreusedapreviouspassword,modifieda firstoffourpointswemakeonthegeneric-advicemodalad- previouspassword,orcreatedanentirelynewpassword. visesagainstreusingapasswordacrossaccounts. Wechose The next three parts of the survey asked about the meter’s to make this the first point because password reuse is very coloredbar,textfeedback,andsuggestedimprovements. At common[9,14,15,38],yetamajorthreattosecurity. thetopofeachpage,weshowedatextexplanationandvisual Werecommendconstructive,action-orientedstrategiesasour exampleofthefeatureinquestion. Participantsinconditions secondandthirdpoints. Reflectingresearchthatshowedthat thatlackedoneormoreofthesefeatureswerenotaskedques- passwords balancing length and character complexity were tionsaboutthatfeature. Forthefirsttwofeatures,weasked oftenstrong[34]andbecauseattackerscanoftenexhaustively participantstorateonafive-pointscalewhetherthatfeature guess all passwords up to at least 9 characters in offline at- “helpedmecreateastrongpassword,”“wasnotinformative,” 3778 Passwords and Authentication CHI 2017, May 6–11, 2017, Denver, CO, USA andcausedthemtocreate“adifferentpasswordthan[they] • BarOnly(Bar)showsacoloredbardisplayingpassword wouldhaveotherwise.” Wealsoaskedabouttheimportance strength,butwedonotprovideanytypeoftextfeedback participantsplaceonthemetergivingtheirpasswordahigh otherthanwhichcompositionrequirementshavebeenmet. score,theirperceptionoftheaccuracyofthestrengthrating, • NoFeedback(None)onlyindicatescompliancewiththe andwhethertheylearnedanythingnewfromthefeedback. password-compositionrequirements. Aftertheparticipantcompletedthesurvey,webroughthimor Dimension3: ScoringStringency hertoaloginpageandauto-filledhisorherusername. The Uretal.foundthestringencyofameter’sscoringhasasig- participantthenattemptedtore-enterhisorherpassword. We nificantimpactonpasswordstrength[42]. Wethustestedtwo refer to this final step as Part 1 recall. After five incorrect scoringstringencies. Thesestringencieschangedthemapping attempts,welettheparticipantproceed. betweentheestimatednumberofguessesthepasswordwould resistandhowmuchofthecoloredbarwasfilled. Part2 After48hours,weautomaticallyemailedparticipantstore- • Medium(M)One-thirdofthebarfullrepresents106esti- turn and re-enter their password. We term this step Part 2 matedguessesandtwo-thirdsfullrepresents1012. recall,anditwasidenticaltoPart1recall. Wethendirected • High(H)One-thirdofthebarfullrepresents108estimated participantstoasurveyabouthowtheytriedtoremembertheir guessesandtwo-thirdsfullrepresents1016. password. Inparticular,wefirstaskedhowtheyenteredtheir AdditionalConditionsforExperiment2 password on the previous screen. We gave multiple-choice Experiment2addedthefollowingtwosettingsforourfeed- optionsencompassingautomaticentrybyapasswordmanager backandstringencydimensions,respectively: orbrowser,typingthepasswordinentirelyfrommemory,and lookingapasswordupeitheronpaperorelectronically. • Feedback:Standard,NoBar(NoBar)TheStandardmeter withoutanycoloredbar. Thetextfeedbackstilldependson Conditions thepassword’sscore,sostringencystillmatters. InExperiment1,weassignedparticipantsround-robintoone • Stringency: Low (L) One-third of the bar full represents of 18 different conditions that differed across three dimen- 104estimatedguessesandtwo-thirdsfullrepresents108. sions in a full-factorial design. We refer to our conditions usingthree-partnamesreflectingthedimensions:1)password- To investigate these two settings we introduced eight new compositionpolicy;2)typeoffeedback;and3)stringency. conditionsandre-ranthefourexistingStd-MandStd-Hcon- ditionsinafull-factorialdesign. Dimension1: Password-CompositionPolicy Weexpectedametertohaveadifferentimpactonpassword Analysis securityandusabilitydependingonwhetheritwasusedwith Wecollectednumeroustypesofdata.Ourmainsecuritymetric a minimal or a more complex password-composition pol- wastheguessabilityofeachparticipant’spassword,ascalcu- icy. We tested the following two policies, which represent lated by CMU’s Password Guessability Service [7], which awidespread,laxpolicyandamorecomplexpolicy. modelsfourtypesofguessingattacksandwhichtheyfound to be a conservative proxy for an expert attacker [44]. Our • 1class8 (1c8) requires that passwords contain 8 or more usability measurements encompassed both quantitative and characters,andalsothattheynotbe(case-sensitive)oneof qualitative data. We recorded participants’ keystrokes, en- the96,480suchpasswordsthatappearedfourormoretimes ablingustoanalyzemetricslikepasswordcreationtime. To intheXatocorpusof10millionpasswords[6]. understandtheuseofdifferentfeatures,weinstrumentedallel- • 3class12(3c12)requiresthatpasswordscontain12ormore ementsoftheuserinterfacetorecordwhentheywereclicked. charactersfromatleast3differentcharacterclasses(lower- ForbothPart1andPart2,wemeasuredwhetherparticipants caseletters,uppercaseletters,digits,andsymbols). Italso successfully recalled their password. For participants who requiresthatthepasswordnotbe(case-sensitive)oneofthe didsuccessfullyrecalltheirpassword,wealsomeasuredhow 96,926suchpasswordsthatappearedintheXatocorpus[6]. long it took them, as well as how many attempts were re- Dimension2: TypeofFeedback quired. Because notallparticipantsreturnedfor Part2, we Ourseconddimensionvariesthetypeoffeedbackweprovide measuredwhatproportionofparticipantsdid,hypothesizing toparticipantsabouttheirpassword. Whilethefirstsetting thatparticipantswhodidnotremembertheirpasswordmight representsourstandardmeter,weremovedfeaturesforeach be less likely to return. To only study attempts at recalling oftheothersettingstotesttheimpactofthosefeatures. apasswordfrommemory,weanalyzedpasswordrecallonly forparticipantswhosaidtheytypedtheirpasswordinentirely • Standard(Std)includesallpreviouslydescribedfeatures. frommemory,saidtheydidnotreusetheirstudypassword, • NoSuggestedImprovement(StdNS)isthesameasStan- andwhosekeystrokesdidnotshowevidenceofcopy-pasting. dard,exceptitneverdisplaysasuggestedimprovement. Weaugmentedourobjectivemeasurementswithanalysesof • Public(Pub)isthesameasStandard,exceptwenevershow responsestomultiple-choicequestionsandqualitativeanalysis sensitive text feedback (i.e., we never show a suggested of free-text responses. These optional free-text responses improvementandalwaysshowthelessinformative“public” solicitedparticipants’thoughtsabouttheinterfaceelements, feedbacknormallyshownwhenthepasswordishidden). aswellaswhytheydid(ordidnot)findthemuseful. 3779 Passwords and Authentication CHI 2017, May 6–11, 2017, Denver, CO, USA Ourprimarygoalwasunderstandingthequantitativeeffects frequently,whileothersareenteredveryrarely,andtheme- of varying the three meter-design dimensions. Because we ter’simpactonmemorability(orlackthereof)maydifferin hadmultipleindependentvariables,eachreflectingonedesign thesescenarios. Furthermore,whilewestudiedourmeter’s dimension,weperformedregressions. Weranalinearregres- impactonthecreationofasinglepassword,userstypically sionforcontinuousdata(e.g.,thetimetocreateapassword), havealargenumberofpasswords,potentiallyimpactingmulti- alogisticregressionforbinarydata(e.g.,whetherornotthey accountintereferenceeffects. Ifafuturestudyweretofind clickedonagivenUIelement),anordinalregressionforor- thatourmeterincreasesmulti-accountinterference,themeter dinal data (e.g., Likert-scale responses), and a multinomial mightbebestdeployedonlyforhigh-valueaccounts. logisticregressionforcategoricaldatawithnoclearordering Ourstudy’secologicalvalidityisalsolimitedinotheraspects. (e.g.,howtheyenteredtheirpassword). Wedidnottesthabituationeffects,eithertotheuseofapar- Foroursecurityanalyses,weperformedaCoxProportional- ticularpasswordortothenovelpassword-meterfeatureswe Hazards Regression, which is borrowed from the literature tested. Ourpasswordmetermightcauselong-termchangesin onsurvivalanalysisandhasbeenusedtocomparepassword howuserscreatepasswordsinthewildthatonlyanextended guessability[31].Becauseweknowthestartingpointofguess- field study could capture. On the one hand, lessons users ingbutnottheendpoint,weusearight-censoredmodel[17]. learnfromourmeteraboutpasswordcreationcouldhelpthem In a traditional clinical model using survival analysis, each createstrongerpasswordsforsitesthatdonotprovidefeed- datapointismarkedas“alive”or“deceased,”alongwiththe back. Ontheotherhand,potentialusabilitydrawbacksmight timeoftheobservation. Ouranaloguesforpasswordsare“not become more pronounced or have a larger effect over time guessed”and“guessed,”alongwiththenumberofguessesat inthewild. Inaddition,passwordcreationwastheprimary whichthepasswordwasguessed,ortheguessingcutoff. taskinourstudy. Ourmetermighthaveadifferentimpactin themoretypicalscenarioofpasswordcreationbeingusers’ Wealwaysfirstfitamodelwiththethreedesigndimensions secondarytask,whichmightlowertheirmotivationtocreate (compositionpolicy,feedback,andstringency)eachtreatedas strongpasswords. ordinalvariablesfitlinearly,aswellasinteractiontermsfor eachpairofdimensions. Tobuildaparsimoniousmodel,we PARTICIPANTS removedanyinteractiontermsthatwerenotsignificant,yet 4,509 participants (2,717 in Experiment 1 and 1,792 in Ex- alwayskeptallthreemaineffects,andre-ranthemodel. periment 2) completed Part 1, and 84.1% of them finished We corrected for multiple testing using the Benjamini- Part2.Amongourparticipants,52%identifiedasfemale,47% Hochberg(BH)procedure[3]. Wechosethisapproach,which asmale,andtheremaining1%identifiedasanothergender ismorepowerfulandlessconservativethanmethodslikeHolm or preferred not to answer. Participants’ ages ranged from Correction,becauseweperformedanexploratorystudywitha 18 to 80 years old, with a median of 32 (mean 34.7). We largenumberofvariables. Wecorrectedallp-valuesforeach askedwhetherparticipantsare“majoringinor...haveadegree experimentasagroup. Weuseα =0.05. orjobincomputerscience,computerengineering,informa- tiontechnology,orarelatedfield,”and82%responded“no.” NotethatweanalyzedExperiments1and2separately. Thus, Demographicsdidnotvarysignificantlybycondition. we do not compare conditions between experiments. How- ever,inthegraphsandtablesthatfollowwehavecombined RESULTS ourreportingofthesetwoexperimentsforbrevity. Weonly Increasinglevelsofdata-drivenfeedback, evenbeyondjust report“NoBar”andlow-stringencydatafromExperiment2in acoloredbar,leduserstocreatestronger1class8passwords. thesetablesandgraphs. Forconditionsthatwerepartofboth Thatis,detailedtextfeedbackledtomoresecurepasswords experiments,weonlyreporttheresultsfromExperiment1. than the colored bar alone. The 3class12 policy also led to strongerpasswords.Incontrast,varyingthescoringstringency hadonlyasmallimpact. Limitations We based our study’s design on one researchers have used Amongourusabilitymetrics,alldimensionsaffectedtiming previouslytoinvestigatevariousaspectsofpasswords[21,28, andsentiment,buttheymostlydidnotaffectpasswordmem- 34,42]. However,thepasswordparticipantscreateddidnot orability. Notably,althoughincreasinglevelsofdata-driven protect anything of value. Beyond our request that they do feedbackledtostrongerpasswords,wedidnotobserveany so,participantsdidnotneedtoexhibittheirnormalbehavior. significantimpactonmemorability. Table1summarizesour Mazureketal.[31]andFahletal.[13]examinedtheecological keyfindings. Throughoutthissection,ifwedonotexplicitly validityofthisprotocol,findingittobeareasonable,albeit calloutametricasrevealingadifferencebetweenvariantsof imperfect,proxyforhigh-valuepasswordsforrealaccounts. themeter,thenwedidnotobserveasignificantdifference. Thatsaid,nocontrolledexperimentcancaptureeveryaspect Overall,56.5%ofparticipantssaidtheytypedtheirpassword ofpasswordcreation. Wedidnotcontrolthedevice[49,53] inentirelyfrommemory. Otherparticipantslookeditupon onwhichparticipantscreatedapassword,norcouldwecon- paper (14.6%) or on an electronic device (12.8%), such as trolhowparticipantschosetoremembertheirpassword. We theircomputerorphone. Anadditional11.2%ofparticipants testedpasswordrecallatonlytwopointsintime. Ourlimited said their password was entered automatically for them by evaluationofmemorabilitydoesnotcapturemanyaspectsof a password manager or browser, while 4.9% entered their thepasswordecosystem. Somepasswordsareenteredvery passwordinanotherway(e.g.,lookinguphints). 3780 Passwords and Authentication CHI 2017, May 6–11, 2017, Denver, CO, USA 60% 1c8−None password’ssurvivaltermasthedependentvariable. ForEx- periment 1, we found significant main effects for both the 1c8−Bar−M policyandtypeoffeedback,butwealsoobservedasignificant interactioneffectbetweenthepolicyandtypeoffeedback.For d increasedintelligibility,wesubsequentlyranseparateregres- e40% ss 1c8−Std−M sions for 1class8 and 3class12 passwords. As the 3class12 e gu policyrequireslongerpasswordsthan1class8,participantsun- nt 3c12−None surprisinglycreated3class12passwordsthatweresignificantly Perce20% 33cc1122−−BSatdr−−MM longer (p<.001) and significantly more secure (p<.001) than1class8passwords. Moving from a 1class8 to a 3class12 policy increased the timeittooktocreatethepassword,measuredfromthefirst 0% keystroketothelastkeystrokeinthepasswordbox(p=.014). 101 103 105 107 109 1011 1013 1015 Italsoimpactedparticipantsentiment. The3class12policy Guesses led participants to report password creation as significantly Figure4: Guessabilityofmedium-stringencypasswordscre- moreannoyinganddifficult(both p<.001). atedwithoutanyfeedback(“None”),withonlyacoloredbar Passwords created under a 3class12 policy were more se- (“Bar”),andwithbothacoloredbarandtextfeedback(“Std”). cure than those created under 1class8, but 3class12 partici- pants were less likely to remember their passwords during Part2(p=.025). Acrossconditions,81.3%of1class8partic- 60% 1c8−None ipantsrecalledtheirpasswordduringPart2,whereas75.0% of3class12participantsdid. Thepolicydidnotsignificantly 1c8−Bar−M impactanyotherrecallmetrics. 1c8−Pub−M d 1c8−StdNS−M se40% 1c8−NoBar−M es 1c8−Std−M gu ImpactoftheAmountofFeedback nt For1class8passwords,wefoundthatincreasinglevelsofdata- e c Per20% dpraisvsewnofredesd(bpac<k.l0ed01p)a.rRticeilpaatinvtesttoohcarevaintegsnigonfiefiecdabnatclyk,sttrhoenfguelrl suiteofdata-drivenfeedbackledto44%strongerpasswords. AsshowninFigure4,thecoloredbaronitsownledpartici- 0% pantstocreatestrongerpasswordsthanhavingnofeedback, echoingpriorwork[42]. Thedetailedtextfeedbackweintro- 101 103 105 107 109 1011 1013 Guesses duceinthisworkledtostrongerpasswordsthanjustthebar (Figure 4). Increasing the amount of feedback also led par- Figure5: Guessabilityof1class8, medium-stringencypass- ticipantstocreatelongerpasswords(p<.001). Forexample, wordscreatedwithdifferentlevelsoffeedback. Thisgraph themedianlengthof1c8-Nonepasswordswas10characters, shows data from Experiment 1. However, we have added whereasthemedianfor1c8-Std-Mwas12characters. 1c8-NoBar-MfromExperiment2forillustrativepurposes. Notably,thesecurityof1class8passwordscreatedwithour standardmeter(includingalltextfeedback)wasmoresimilar Consideringpasswordreuseandcopy-pasting,50.6%ofpar- tothesecurityof3class12passwordscreatedwithoutfeedback ticipantstriedtorecallanovelstudypasswordfrommemory; thanto1class8passwordscreatedwithoutfeedback(Figure4). thesearetheparticipantsforwhomweexaminepasswordre- Figure 5 details the comparative security impact of all six call. Overall,98.4%oftheseparticipantssuccessfullyrecalled feedback levels on 1class8 passwords. Whereas suggested theirpasswordduringPart1,andthemajoritydidsoontheir improvementshadminimalimpact,havingtheoptiontoshow firstattempt. Intotal,78.2%ofparticipantswhoreturnedfor potentiallysensitivefeedbackprovidedsomesecuritybenefits Part 2 and who tried to recall a novel study password from overthepublic(“Pub”)variant. Whenweinvestigatedremov- memorysucceeded,againprimarilyontheirfirstattempt. ingthecoloredbarfromthestandardmeter,butleavingthe We first discuss how the three different dimensions of the textfeedback,wefoundthatremovingthecoloredbardidnot meter’sdesignimpactedsecurityandusability. Wethendetail significantlyimpactpasswordstrength. howparticipantsinteractedwith,andqualitativelyresponded For3class12passwords,however,theleveloffeedbackdid to,themeter’svariousfeatures. notsignificantlyimpactpasswordstrength. Wehypothesize that either we are observing a ceiling effect, in which the ImpactofCompositionPolicy 3class12policybyitselfledparticipantstomakesufficiently WefirstranaCoxregressionwithallthreedimensionsand strongpasswords,orthatthetextfeedbackdoesnotprovide theirpairwiseinteractionsasindependentvariables,andthe sufficientlyusefulrecommendationsfora3class12policy. 3781 Passwords and Authentication CHI 2017, May 6–11, 2017, Denver, CO, USA 3c12−Std−L 3class12passwordsweretwocharacterslonger. Priorwork testedameterthatusedonlybasicheuristics(lengthandchar- 3c12−Std−M acterclasses)toscorepasswords,withaparticularemphasis 20% on length [42]. As a result, participants could fill more of 3c12−Std−H d thestringentmeterssimplybymakingtheirpasswordlonger. e ss Incontrast,ourmeterscorespasswordsfarmorerigorously, e gu whichwehypothesizemightaccountforthisdifference. nt ce10% Althoughincreasingthescoringstringencyledparticipantsto er P createlongerpasswords,varyingbetweenmediumandhigh stringency did not affect participant sentiment or how long ittooktocreateapassword. Whenweaddedanadditional lowstringencylevelinExperiment2,however,participants 0% withmorestringentmeterstooklongertocreateapassword 101 103 105 107 109 1011 1013 1015 (p<.001)anddeletedmorecharactersduringcreation(p< Guesses .001). Theyalsotooklongertorecalltheirpasswordduring Figure6: Guessabilityof3class12passwordsbystringency. Part1(p=0.010)andwerelesslikelytotryrecallingtheir passwordsolelyfrommemory(p=.002),thoughstringency did not significantly impact other recall metrics. For high- Increasingtheamountoffeedbackincreasedthetimeittook stringencyconditions, increasedfeedbackledtoevenmore tocreatethepassword(p=.011). Weobservedaninteraction deletions(p=.002). However,for3class12policies,higher betweentheamountoffeedbackandthestringency;increasing stringencyledtofewerdeletionsoverall(p=.002). theamountoffeedbackinhigh-stringencyconditionsledtoa Increasingthestringencygreatlyimpactedparticipantsenti- greatertimeincrease(p=.048). ment. Itledparticipantstoperceivepasswordcreationasmore Tounderstandhowparticipantschangetheirpasswordsduring annoying, more difficult, and less fun (p<.001, p<.001, creation,weexaminedthenumberofdeletions,whichwede- p = .010, respectively). It also caused participants to be finedasaparticipantremovingcharactersfromtheirpassword morelikelytosaythebarhelpedthemcreateastrongerpass- thattheyaddedinapriorkeystroke. Increasingamountsof word(p=.027), lesslikelytobelievethebarwasaccurate feedbackledtosignificantlymoredeletions(p<.001), im- (p<.001),andlesslikelytofinditimportantthatthebargives plyingthatthefeedbackcausesparticipantstochangetheir themahighscore(p=.006). Increasinglevelsofstringency password-creationdecisions. Forinstance,themediannumber made participants more likely to say the text feedback led ofdeletionsfor1c8-Nonewaszero,whilethemediannumber themtocreateadifferentpasswordthantheywouldhaveoth- for1c8-Std-Hwas9. erwise(p=.010),butalsolesslikelytobelievetheylearned somethingnewfromthetextfeedback(p<.001). Increasingtheamountoffeedbacknegativelyaffectedpartici- pantsentiment. Itledparticipantstoreportpasswordcreation TextFeedback asmoreannoying(p<.001),moredifficult(p<.001),and Participants reacted positively to the text feedback. Most lessfun(p=.025). Foreachsentiment,roughly10%–15% participants(61.7%)agreedorstronglyagreedthatthetext oftheparticipantsinthatconditionmovedfromagreementto feedback made their password stronger. Similarly, 76.9% disagreement,orviceversa. disagreed or strongly disagreed that the feedback was not Eventhoughincreasingtheamountoffeedbackledtosignifi- informative, and 48.7% agreed or strongly agreed that they cantlymoresecurepasswords,itdidnotsignificantlyimpact createdadifferentpasswordthantheywouldhaveotherwise anyofourrecallmetrics. becauseofthetextfeedback. Higherstringencyparticipants were more likely to say they created a different password (p=.022),butnootherdimensionsignificantlyimpactedany ImpactofStringency otheroneofthesesentiments. Althoughpriorworkonpasswordmetersfoundthatincreased scoringstringencyledtostrongerpasswords[42],wefound Althoughmostparticipants(68.5%)selected“no”whenwe thatvaryingbetweenmediumandhighstringencydidnotsig- askediftheylearned“somethingnewaboutpasswords(your nificantlyimpactsecurity. Becausebothourmediumandhigh password,orpasswordsingeneral)fromthetextfeedback,” stringency levels were more stringent than most real-world 31.5% selected “yes.” Participants commonly said they passwordmeters,weinvestigatedanadditionallowstringency learnedaboutmovingcapitalletters, digits, andsymbolsto settinginExperiment2. Withthesethreelevels,wefoundthat lesspredictablelocationsfromthemeter(e.g.,“Ididn’tknow increasinglevelsofstringencydidleadtostrongerpasswords, itwashelpfultocapitalizeaninternalletter.”). Manypartici- butonlyfor3class12passwords(p=.043). Figure6shows pantsalsonotedthatthemeter’srequestsnottousedictionary the impact of varying the 3class12 stringency. In all cases, entriesorwordsfromWikipediaintheirpasswordtaughtthem however,increasingthestringencyledparticipantstocreate somethingnew. Oneoftheseparticipantsnotedlearning“that longerpasswords(p<.001). Forinstance,themedianhigh- hackers use Wikipedia.” Requests to include symbols also stringency 1class8 password was one character longer than resonated. Asoneparticipantwrote,“Ididn’tknowpreviously themedianlow-stringency1class8password. High-stringency thatyoucouldinputsymbolsintoyourpasswords.” 3782 Passwords and Authentication CHI 2017, May 6–11, 2017, Denver, CO, USA Table1:Asummaryofhowmovingfroma1class8to3class12 improvement,81.5%saidthatsuggestedimprovementswere policy,increasingtheamountoffeedback,orincreasingthe useful, while 18.5% said they were not. A slight majority scoringstringencyimpactedkeymetrics. (50.9%)oftheseparticipantsagreedorstronglyagreedthat thesuggestedimprovementhelpedthemmakeastrongerpass- Metric Policy Feedback Stringency word. Thishelp,however,didnotoftencomeintheformof Security adoptingtheprecisesuggestionoffered. Ineachconditionthat Passwordshardertoguess ✦ 1class8only 3class12only offeredasuggestedimprovement,atmost7%ofparticipants usedoneofthemeter’ssuggestedpasswordsverbatim. Our Passwordcreation Longerpasswords ✦ ✦ ✦ qualitativefeedbackindicatedthatthesuggestedimprovement Moretimetocreate ✦ ✦ ✦ oftensparkedotherideasformodifyingthepassword. Moredeletions – ✦ ✦ Morelikelytoshowonscreen ✦ – ✦ Weaskedparticipantswhofoundthesuggestedimprovements Lesslikelytoshowonscreen – ✦ – useful to explain why. They wrote that seeing a suggested Morelikelytoshowsuggested – – ✦ improvement“helpsyoumodifywhatyoualreadyhaveinstead improvement ofhavingtothinkofsomethingabsolutelynew”and“breaks Sentimentaboutcreation yououtofpatternsyoumighthavewhencreatingpasswords.” Moreannoying ✦ ✦ ✦ Moredifficult ✦ ✦ ✦ Participantsparticularlylikedthatthesuggestedimprovement Lessfun – ✦ ✦ wasamodificationoftheirpassword,ratherthananentirely new one, because it “may help spark ideas about tweaking Passwordrecall LessmemorableinPart1 – – – the password versus having to start from scratch.” As one Part1recalltooklonger – – ✦ participantsummmarized,“Itactuallyofferssomehelpinstead LessmemorableinPart2 ✦ – – ofjustshuttingyoudownbyessentiallysaying‘no,notgood Part2recalltooklonger – – – enough,comeupwithsomethingelse.’ It’sveryhelpful.” Requiredmoreattempts – – – Participantlesslikelytotry – – ✦ Participantswhodidnotfinditusefulexpressedtwostreamsof recallingfrommemory reasoning. Thefirstconcernedmemorability. Oneparticipant explained,“I’mmorelikelytoforgetapasswordifIdon’tuse Participantsalsotookthetextfeedbackasanopportunityfor familiartechniques.” whileanotherwrote,“Ialreadyhavea reflectionontheirpassword-creationstrategies. Onepartici- certainformatinmindwhenIcreatemypasswordtohelpme pantlearned“thatItendtousefullwordswhichisn’tgood,” memorizeitandIdon’tliketostrayfromthat.” Thesecond whileanotherlearned“don’tbasethepasswordofftheuser- streamconcernedthetrustworthinessofthe“algorithmthat name.” Participants exercised many of the features of our createsthosesuggestions.” Asoneparticipantwrote,“Idon’t feedback,includingparticipantswho“learnedtonotuseL33T trustit. Idon’twantacomputerknowingmypasswords.” tocreateapassword(exchanginglettersforpredictablenum- bers).” Someparticipantsalsolearnedaboutpasswordreuse, ModalWindows notablythat“peoplestealyourpasswordsindatabreachesand Clickinga“(why?)” linknexttoanyoftheuptothreepieces theytrytouseittoaccessotheraccounts.” offeedbackinanyconditionwithtextfeedbackopenedthe specific-advice modal. Few participants in our experiment SuggestedImprovement clickedon“(why?)” links,andthereforefewparticipantssaw Whenparticipantsinapplicableconditionsshowedtheirpass- thespecific-advicemodal. Only1.0%ofparticipantsacross wordorclicked“seeyourpasswordwithourimprovements,” allconditionsclickedononeoftheselinks. theywouldseethesuggestedimprovement.Acrossconditions, 37.8%ofparticipantsclickedthe“showpassword”checkbox. Incontrast,8.4%ofparticipantslookedatthegeneric-advice Participants who made a 3class12 password, saw a higher- modal, though 3class12 participants were less likely to do stringencymeter,orwhosawlessfeedbackweremorelikely so(p=.025). Participantscouldarriveatthegeneric-advice to show their password (p<.001, p=.006, and p=.022, modalbyclicking“howtomakestrongpasswords”orclicking respectively). Whilemostparticipantswhosawasuggested “(why?)”nexttoadmonitionsagainstpasswordreuse. Partici- improvementdidsobecausetheychecked“showpassword& pantsarrivedatitaboutevenlythroughthesetwomethods. detailedexplanations,”8.7%ofparticipantsinapplicablecon- ditionsspecificallyclickedthe“seeyourpasswordwithour ColoredBar improvements”button. Higherstringencymadeparticipants morelikelytoshowthesuggestedimprovement(p=.003). Wealsoanalyzedparticipants’reactionstothecoloredbar.All threedimensionsimpactedhowmuchofthecoloredbarpar- When we asked in the survey whether participants in those ticipantsfilled. Participantswhowereassignedthe3class12 conditionshadseenasuggestedimprovement,34.8%selected policy or saw more feedback filled more of the bar, while “yes,” 55.6%selected“no,” and9.6%chosethe“Idon’tre- participants whose passwords were rated more stringently member” option. Nearly all participants who said they did unsurprisinglyfilledless(all p<.001). Fewparticipantscom- notseeasuggestedimprovementindeedwerenevershown pletelyfilledthebar(estimatedguessnumbers1018and1024 a suggested improvement because they never showed their in medium and high stringency, respectively). The median password or clicked “see your password with our improve- participantoftenfilledhalftotwo-thirdsofthebar,depending ments.” Amongparticipantswhosaidtheysawasuggested onthestringency. Forinstance,for1c8-Std-M,only16.5% 3783 Passwords and Authentication CHI 2017, May 6–11, 2017, Denver, CO, USA completelyfilledthebar,but51.7%filledatleasttwo-thirds, sensitivefeedbackwhentheusershowshisorherpasswordon and73.1%filledatleasthalf. screen. Whilemuchofthistextfeedbackmightberedundant forpowerusers,theyarefreetoignoreit. Overall,participantsfoundthecoloredbaruseful. Themajor- ityofparticipants(64.0%)agreedorstronglyagreedthatthe Thespecificimprovementstoauser’spasswordsuggestedby coloredbarhelpedthemcreateastrongerpassword, 42.8% our meter did seem to help some participants, but the over- agreed or strongly agreed that the bar led them to make a alleffectwasnotstrongandsomeparticipantsdidnottrust differentpasswordthantheywouldhaveotherwise,and77.2% suggestions from a computer. While the inclusion of such disagreed or strongly disagreed with the statement that the suggested improvements does not seem to hurt, we would coloredbarwasnotinformative. Participantsalsolookedto consideritoptional. Similarly, althoughthegeneric-advice thecoloredbarforvalidation; 50.9%ofparticipantsagreed modalwasvisitedmorethanthespecific-advicemodal,only or strongly agreed that it is important that the colored bar afractionofparticipantslookedatit. Becausenotallusers gives their password a high score. High-stringency partici- needtolearnthebasicsofmakingstrongpasswords[41],itis pants were less likely to care about receiving a high score reasonablethatonlyahandfulofuserswouldneedtoengage (p=.025). With increasing amounts of feedback, partici- withthesefeatures. Wethusrecommendthattheybeincluded. pantsweremorelikelytocareaboutreceivingahighscore (p=.002),morelikelytosaythatthebarhelpedthemcreate Incontrasttopriorworkthatfoundscoringstringencytobe crucialforpasswordmeters[42],weonlyobservedasignifi- astrongerpassword(p<.001)thatwasdifferentthanthey cantsecurityeffectfor3class12passwords,andtheeffectsize wouldhaveotherwise(p<.001). Theywerealsolesslikely tobelievethebarwasnotinformative(p=.024). wassmall. Notethatourmeterusedfarmoreadvancedmeth- odstoscorepasswordsmoreaccuratelythanthebasicheuris- Participantsmostlyfeltthecoloredbarwasaccurate. Across tics tested in that prior work. Because the high-stringency conditions,68.2%ofparticipantsfeltthebarscoredtheirpass- settingnegativelyimpactedsomeusabilitymetrics,werecom- word’sstrengthaccurately,while23.6%feltthebargavetheir mendourmediumsetting. passwordalowerscorethandeserved. Anadditional4.2%of Ourrecommendationsdifferfor3class12passwords. Theme- participantsfeltthebargavetheirpasswordahigherscorethan terhadminimalimpactonthesecurityof3class12passwords. deserved,while4.0%didnotrememberhowthebarscored Whilethemeterintroducedfewusabilitydisadvantages,sug- theirpassword. Participantsinmorestringentconditionswere gestingthatitmaynothurttoincludethemeter,wewouldnot lesslikelytofindthebar’sratingaccurate(p<.001). recommenditnearlyasstronglyasfor1class8passwords. Wealsotestedremovingthecoloredbarwhilekeepingalltext While the studies we report in this paper are a crucial first feedback. Removingthecoloredbarcausedparticipantstobe morelikelytoreturnforPart2ofthestudy(p=.020),butdid step in pinpointing the impact of meter design dimensions andconfigurations,thenextstepistoinvestigatetheseeffects notimpactanyotherobjectivesecurityorusabilitymetrics. furtherinafieldstudy. Suchastudycouldimproveecological Removingthecoloredbardidimpactparticipantsentiment, validity,allowforexpandedtestingofpasswordmemorability, however. Participantswhodidnotseeabarfoundpassword andenablecomparisonswithcurrentlydeployedmeters. creationmoreannoyinganddifficult(both p<.001). Tospuradoptionofdata-drivenpasswordmeters,wearere- leasingourmeter’scodeopen-source.2 DISCUSSIONANDDESIGNRECOMMENDATIONS Wedescribedourdesignandevaluationofapasswordmeter thatprovidesdetailed,data-drivenfeedback. Usingacombina- ACKNOWLEDGMENTS tionofaneuralnetworkand21carefullycombinedheuristics ThisworkwassupportedinpartbyagiftfromthePNCCenter toscorepasswords,aswellasgivingusersdetailedtextexpla- forFinancialServicesInnovation,NSFgrantCNS-1012763, nationsofwhatpartsoftheirpasswordarepredictable, our andbyJohn&ClaireBertucci. metergivesusersaccurateandactionableinformationabout theirpasswords. REFERENCES 1. StevenVanAcker,DanielHausknecht,WouterJoosen, We found that our password-strength meter made 1class8 andAndreiSabelfeld.2015.Passwordmetersand passwords harder to guess without significantly impacting generatorsontheweb: Fromlarge-scaleempiricalstudy memorability. Textfeedbackledtomoresecure1class8pass- togettingitright.InProc.CODASPY. words than a colored bar alone; colored bars alone are the type of meter widely deployed today [10]. Notably, show- 2. AnneAdams,MartinaAngelaSasse,andPeterLunt. ingdetailedtextfeedbackbutremovingthecoloredbardid 1997.Makingpasswordssecureandusable.InProc.HCI notsignificantlyimpactthesecurityofthepasswordspartic- onPeopleandComputers. ipantscreated. Combinedwithourfindingthatmostpeople do not feel compelled to fill the bar, this suggests that the 3. YoavBenjaminiandYosefHochberg.1995.Controlling visualmetaphorhasonlymarginalimpactwhendetailedtext thefalsediscoveryrate: Apracticalandpowerful feedbackisalsopresent. Asaresult,wehighlyrecommend approachtomultipletesting.JournaloftheRoyal the use of a meter that provides detailed text feedback for StatisticalSociety,SeriesB57,1(1995),289–300. commonpassword-compositionpolicieslike1class8. From our results, we recommend that the meter offer potentially 2Sourcecode:https://github.com/cupslab/password_meter 3784

Description:
composition. In Proc. INTERACT. 49. Emanuel von Zezschwitz, Alexander De Luca, and. Heinrich Hussmann. 2014. Honey, I shrunk the keys: Influences of mobile devices on password composition and authentication performance. In Proc. NordiCHI. 50. Kim-Phuong L. Vu, Robert W. Proctor, Abhilasha.
See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.