ebook img

Harvesting SSL Certificate Data to Identify Web-Fraud PDF

0.26 MB·English
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Harvesting SSL Certificate Data to Identify Web-Fraud

Harvesting SSL Certificate Data to Identify Web-Fraud Mishari Al Mishari, Emiliano De Cristofaro, Karim El Defrawy and Gene Tsudik Information and Computer Science, University of California Irvine {malmishari,edecrist,keldefra,gts}@ics.uci.edu 9 0 0 Abstract—Web-fraud is one of the most unpleasant i.e. instructing browsers to fetch advertisements from 2 featuresoftoday’sInternet.Twoeminentexamplesofweb- a server and blending them with content of the web fraudulentactivitiesarephishingandtyposquatting.Their c site that the user intends to visit [47]. In addition to e effects range from relatively benign (such as unwanted or displayingunexpectedpages,typo-domainsoftendisplay D unexpected ads) to downright sinister (especially, when typosquatting is combined with phishing). This paper malicious, offensive and unwanted content, install mal- 5 presents a novel technique to detect web-fraud domains ware ([4], [46]) and certain typo-domains of children- 1 that utilize HTTPS. To achieve this, we conduct the first oriented web sites redirect users to adult content [10]. comprehensive studyof SSL certificates for legitimate and Worseyet,typo-domainsoffinancialwebsitescanserve ] popular domains, as opposed to those used for web-fraud. R Drawingfromextensivemeasurements,webuildaclassifier asnaturalplatformsforpassivephishingattacks1.Recent C that detects malicious domains with high accuracy. We studies have assessed the popularity of both types of validateourmethodologywithdifferentdatasetscollected maliciousactivities[44],[15].Forexample,inFall2008, . s from the Internet. Our prototype is orthogonal to existing McAfee Alert Labs found more than 80,000 domains c [ mitigation techniques and can be integrated with other typosquatting on just the top 2,000 web sites [20]. 2 atevnadileadblbeensoelfiuttsioonfs.coOnufirdewnotiraklitsyhaonwdsathuatht,enbteicsiidtye,sthitesuinse- Also, according to the Anti-Phishing Working Group, v of HTTPS can help mitigate web-fraud. the number of reported phishing attacks between April 2006 and April 2007 exceeded 300,000 [2]. 8 8 I. INTRODUCTION Nevertheless,theproblemremainsfarfromsolvedand 6 The Internet and its main application – the Web – continues to be a race between the web-fraudsters and 3 have been growing continuously in recent years. Just the security community. . 9 in the last two years the Web content has doubled in Ourgoalistocounterweb-fraudbydetectingdomains 0 size from 10 billion to over 20 billion pages, according hosting such activities. Our approach is inspired by 9 to results from Google and Yahoo [13]. Unfortunately, several recent discussions and results within the secu- 0 suchgrowthhasbeenaccompaniedbyaparallelincrease rity community. Security researchers and practitioners : v of nefarious activities, as reported by [40]. The web increasingly advocate transition to HTTPS for all web i X representsaveryappealingplatformforvarioustypesof transactions, similar to that from Telnet to SSH. Emi- r electronicfraud,ofwhichphishingandtyposquattingare nent examples of such discussions can be found in [5] a twoexamples.Phishingisthewell-knownactivityaimed and [48]. Moreover, a recent study [38] has pointed out at eliciting sensitive information – e.g., usernames, the emergence of Phishing attacks abusing SSL certifi- passwords and credit card details – from unsuspecting cates. The main goal is to avoid raising the suspision of users. It typically starts with a user being directed to users by masquerading as legitimate “secure” site. a fake web site with the look-and-feel of a legitimate, This leads us to the following questions: familiar and/or well-known web site. Consequences of 1) How prevalent is HTTPS on the Internet? phishing can range from denial-of-service to full-blown 2) How different are SSL certificates used by web- identity theft, followed by real financial losses. Only fraudsters from those of legitimate domains? in 2007, more than 3 Billion U.S. dollars were lost 3) CanweusetheinformationintheSSLcertificates to such attacks [3]. Typosquatting is the practice of to identify web-fraud activities such as phishing registering domain names that are typographical errors (or minor spelling variations) of well-known web site 1Passive phishing attacks do not rely on email/spam campaigns to addresses (target domains) [47]. It is often related to leadpeopletoaccessthefakewebsite.Instead,userswhomis-typea domain parking services and advertisement syndication, commondomainnameendupbeingdirected toaphishingwebsite. andtyposquatting,withoutcompromisinguserpri- vacy? Contributions. This paper makes several contributions. First, we measure the overall prevalence of HTTPS in popular and randomly sampled Internet domains. We then considerthe popularityof HTTPSin the contextof web-fraudbystudyingitsuseinphishingandtyposquat- ting activities. We then analyze, for the first time, all fields of SSL certificates and we identify usefulfeatures and patterns that represent possible symptoms of web- fraud. Fig.1. X.509 Certificate Format Based on the measurement findings, we propose a novel technique to identify web-fraud domains that use Data set Size HTTP HTTPS HTTP HTTPS, based on a classifier that analyzes certificates Only Only & HTTPS Alexa 100K 66% 8% 26% of web domains. We validate the classifier by training .com 100K 79% 9% 12% and testing it over data collected from the Internet. .net 100K 84% 13% 3% The classifier achieves a detection accuracy over 80% phishing 2811 70% 5% 25% and, in some cases, as high as 95%. Our classifier typosquatting 9830 95.1% 0% 4.9% is orthogonal to prior mitigation techniques and can TABLEI be integrated with other methods that do not rely on DESCRIPTIONOFDATASETSANDUTILIZATIONOF HTTP/HTTPS HTTPS to face a complete range of malicious domains. Also the integration with pre-existing solutions could improve their efficiency in the context of potentially critical applications enforcing HTTPS. Note that the e-mailaddressoraDNS-entry).Web-browsers–suchas classifieronlyreliesondataintheSSLcertificateandnot InternetExplorer,Mozilla/Firefox,Chrome and Safari – any other private user information. Finally, our findings comewithpre-installedtrustedrootcertificates.Browser serve as an additional motivation for increasing the use software makers determine which CAs are trusted third of HTTPS, not only to guarantee confidentiality and parties for the browsers’ users. The root certificates can authenticity, but also to help combat web-fraud. be manually removed or disabled, however, users rarely Paper Organization.The rest of the paper is organized doso.ThegeneralstructureofanX.509v3[6]certificate as follows. Section II introduces a brief overview of is shown in Figure 1. As discussed later in Section III, X.509 certificates. Section III presents the rationale and we analyze all fields for certificates of both legitimate detailsofourmeasurementsandanalysis.InSectionIV, and malicious domains. we describe the details of a classifier that detects mali- III. MEASUREMENTSAND ANALYSIS OFSSL ciousdomains,basedoninformationobtainedfromSSL CERTIFICATES certificates. Section V discusses the implications of our findingsandthelimitationsofourstudy.SectionVIcon- We first describe our data sets and how we collect tains an overview of related work. Finally, we conclude them. We then presentour analysis and show how these our paper in Section VII. results guide the design of a classifier that detects web- fraud domains. II. X.509CERTIFICATES A. HTTPS Usage and Certificate Harvest The term X.509certificate usually refersto an IETF’s PKIX Certificate and CRL Profile of the X.509 v3 Our measurement data was collected mainly during certificate standard,as specified in RFC 5280[6]. In this September 2009. It includes three types of domain sets: paper we are concerned with the public key certificate Legitimate,PhishingandTyposquatting.Eachdomainis portionofX.509.Intherestofthepaper,werefertothe probed twice. We probe each domain for web existence publickeycertificateformatofX.509asacertificate,or (bysendinganHTTPrequest)andforHTTPSexistence an SSL/HTTPS certificate. (by sending an HTTPS request). If a domain responds According to X.509, a certification authority (CA) to HTTPS, we harvest its SSL certificate. Below, we issues a certificate binding a public key to an X.500 describe the data sets and the state of HTTP/HTTPS DistinguishedName(ortoanAlternativeName,e.g.,an utilization in them, also summarized in Table I. 2 Legitimate and Popular Domain Data Sets. The fol- accidentally resemble typos of well-known domains. lowing data sets are considered for measurements and We identified the typosquatting domains in this set by analysis of popular and legitimate domains: detectingtheparkeddomainsamongthesetypodomains • Alexa: 100,000 most popular domains according to using the machine-learning-based classifier proposed in Alexa [1] [30] 2 . We discovered that 9,830 out of 38,617 are •.com:100,000randomsamplesof.comdomainzone parked domains. We consider these 9,830 names as the file, collected from VeriSign[29]. data set of typosquatting domains. • .net: 100,000 random samples of .net domain zone As shown in Table I, the result is that 486 ty- file, collected from VeriSign[29]. posquatting domains use HTTPS. They represent our Alexa [1] is a subsidiaryof Amazon.com knownfor typosquatting SSL certificate corpus. For convenience, its web-browser toolbar and for its web site that reports when we refer to typosquatting/phishing set, we mean traffic ranking of Internet web sites. The company does the set of typosquatting/phishing domains that have not provide information on the number of users using SSL certificates. its toolbar, but it claims several millions [14]. We use Overlapping of Sets. There is negligible overlapping Alexarankingasourmeasureofpopularityoflegitimate among data sets. The size of the domain intersection domains. We also randomly sample .com and .net to between .com and Alexa is 81 and the size of domain ensure an unbiased data-set. In the rest of the paper, we intersection between .net and Alexa is 17. There is no useAlexa,.com,and.nettodenotethreerespectivedata overlappingbetween phishing and both .com and .net. sets containing the domains or certificates harvested as The typosquatting set is a subset of the union of .com described above. and .net sets, but it only represents a very small ratio We find that 34% of Alexa domains use HTTPS; (∼1%oftheunionof.comand.netsets).Surprisingly, 26% in .com and 16% in .net. The fact that 34% of therearefourphishingdomainsthatalsobelongtoAlexa Alexa domains respond to HTTPS indicates a healthy set, but the ratio of phishing domains in Alexa set is degree of HTTPS usage among popular Internet do- extremelysmall(∼0.005%ofAlexaset).Thus,popular mains. Our explanation for lower percentage in .net is domains could (in very rare cases) host scam/phishing that, perhapsbecause .net domainsare historically non- web sites. commercial, most of them do not require user interac- Take Away. Based on the analysis of HTTPS usage in tion and do not involve sensitive data, in contrast with our data sets, we attempt to answer the first question commercial-minded .com domains. posedinSection Iandassesshowprevalentistheusage PhishingDataSet.Wecollected2,811domainsconsid- ofHTTPStoday,Wefindthatasignificantpercentageof ered to be hosting phishing scams. They were obtained popularandlegitimatedomainsalreadyuseHTTPS,and from the PhishTank web site [9] as soon as they are so do a relevant percentage of phishing domains. As a listed. We note that reported URLs in PhishTank are result, we proceedby analyzingthe differencesbetween verified by several users and, as a result, are malicious the certificates of legitimate and fraudulent web sites. with high probability. We consider this data set as a B. Certificate Analysis baselineforphishingdomains.We findthatasignificant percentage – 30% – of these phishing web sites employ The goal of this analysis is to guide the design of HTTPS.Thisrepresentsa 5%increaseoverearliermea- our detection method, the classifier. One side-benefit of surements conducted in October 2008. a similar trend the analysis is that it reveals the differences between has also been independently observed in a recent study certificates used by fraudulent and legitimate/popular bySymantec[38].Furthermore,wediscoveredanumber domains. In total, we identified 15 distinct certificate ofphishingdomainsusingHTTPS(∼10%)forwhichwe features listed in Table II.3 Most features map to actual cannot obtain corresponding certificates, due to various SSL errors (e.g., illegal certificate size). 2Typosquatters profit from traffic that accidentally comes to their domains. One common way of doing that is to host typosquatting Typosquatting Data Set. To collect SSL certificates of domains from some parking domain service. Parked domains [8] are ad-portal domains that show ads provided by a third-party service typosquatting domains we first identified the typo do- calledparkingservice,intheformofads-listing[47]sotyposquatters mains in our .com and .net data sets by using Google’s mayprofitfromincomingtraffic(e.g.,ifavisitorclicksonasponsored typo correction service [11] to identify possible typos link). in domain names. This results in 38,617 typo domains. 3Other features and fields (e.g., RSA exponent, Public Key Size, ...) had no substantial differences between legitimate and malicious However, such domains might be benign domains that domains.Thesefeatures areomitted duetospacelimitations. 3 Feature Name Type Used in Notes Classifier F1 md5 boolean Yes TheSignature Algorithm ofthecertificate is “md5WithRSAEncryption” F2 bogussubject boolean Yes Thesubjectsection ofthecertificate hasbogusvalues (e.g.,–,somestate, somecity) F3 self-signed boolean Yes Thecertificate isself-signed F4 expired boolean Yes Thecertificate isexpired F5 verification failed boolean No Thecertificate passestheverification ofOpenSSL0.9.8k25Mar2009(forDebianLinux) F6 commoncertificate boolean Yes Thecertificate ofthegiven domainisthesameasacertificate of another domain. F7 commonserial boolean Yes Theserialnumberofthecertificate isthesameas number theserialofanother one. F8 validity period boolean Yes Thevalidity periodismorethan3years >3years F9 issuercommon string Yes Thecommonnameoftheissuer name F10 issuerorganization string Yes Theorganization nameoftheissuer F11 issuercountry string Yes Thecountrynameoftheissuer F12 subjectcountry string Yes Thecountrynameofthesubject F13 exactvalidity integer No Thenumberofdaysbetweenthestarting duration dateandtheexpiration date F14 serialnumber integer Yes Thenumberofcharacters intheserial number length F15 host-common real Yes TheJaccarddistance value [24] namedistance between hostnameandcommonname inthesubjectsection TABLEII FEATURESEXTRACTEDFROMSSLCERTIFICATES certificate fields, e.g., F1 and F2. Others are computed were surprised that such a high percentageof legitimate fromfieldsinthecertificatebutarenotdirectlyreflected domains is using “md5WithRSAEncryption” as the in the fields, e.g., certificate validation failure (F5) or signature algorithm despite the well-known fact that Jaccard distance between host name and common name rogue certificates can be constructed using MD5 ([42], inthesubjectfield(F15).Notethatsomefeaturesmatch [43]) . to boolean values, whereas, others are integers, reals or F2 (bogus subject): indicates whether the subject strings. Finally, we believe that most interesting results fields have some bogus values (e.g., “ST=somestate”, correspondto: F1 (md5), F3 (self-signed), F4 (expired), “O=someorganization”, “CN=localhost”, ...). 11% of F5(verificationfailed),F6(commoncertificate)andF15 Alexa certificates (7% and 8% of .com and .net, re- (host-common name distance). spectively) satisfy this feature. This percentage is much higher(20%)inphishing (29%in typosquatting).This Analysis of Certificate Boolean Features. Features might indicate that web-fraudsters fill subject values F1–F8 have boolean values, e.g., F1 (md5) is true with bogusdata orleave defaultvalueswhen generating if the signature algorithm used in the certificate is certificates. “md5WithRSAEncryption”. The analysis and distribu- F3 (self-signed): 28% of Alexa certificates are self- tions of these features (as shown in Table III) reveal in- signed (24% and 27% of .com and .net, respectively). teresting and unexpectedissues and differencesbetween We did not expect such a high percentage among le- legitimate and malicious domains. gitimate domains. An interesting open question is to F1 (md5): 19% of Alexa certificates (24% investigate the reason(s) for all these certificates being and 22 % of .com and .net, respectively) use self-signed; however, this is out of the scope of our “md5WithRSAEncryption”. This feature has a much presentwork.The percentagesof self-signedcertificates higher ratio (35%) in phishing but only 26% in in phishing and typosquatting are 36% and 53%, typosquatting. Thehigherratioforphishingsuggestsa respectively. This is expected, since miscreants running possiblecorrelationbetweenphishingdomaincertificates suchdomainsavoidleavingatrailbyobtainingacertifi- and “md5WithRSAEncryption”signature algorithm. We catefromaCA,whichrequiressomedocumentationand 4 Feature Alexa .com .net phish typos F8 (validity period > 3 years): it seems intuitive Name (33905) (21178) (16106) (839) (486) that malicious domains would not acquire certificates F1 19% 24% 22% 35% 26% F2 11% 7% 8% 20% 29% valid for long periods of time. Thus, we analyze F3 28% 24% 27% 36% 53% the percentages of certificates with validity periods F4 21% 25% 22% 26% 43% exceeding 3 years. Alexa has 17% (14% and 17% of F5 29% 40% 37% 36% 70% .com and .net, respectively), while phishing has 13% F6 33% 60% 64% 72% 95% and typosquatting – 19%. We present further details F7 38% 64% 68% 75% 96% on the exact duration of validity periods in the analysis F8 17% 14% 17% 13% 19% of non-boolean features in Section III-B. As expected, TABLEIII the percentage of phishing certificates satisfying F8 is ANALYSISOFBOOLEANCERTIFICATEFEATURES (PERCENTAGESSATISFYINGTHEFEATURES). smaller than that for Alexa. Whereas, percentages for typosquatting and Alexa are close. Analysis of Certificate Non-Boolean Features. F9– payment.Despitehighratiosinlegitimatesets,thereisa F11 are related to the certificate issuer: common name, significant variance between the ratios in the legitimate organization name and country. F12 is the certificate sets and phishing (8-12%). This variance is even larger subject’s country. between legitimate sets and typosquatting (25-29%). F9 (issuer common name) and F10 (issuer organi- F4 (expired): another alarming result is the fraction zation): 4 some of the common-names are popular in of expired certificates, 21% of Alexa (25% and 22% of phishing but not in Alexa. For example, “Equifax” is .comand.net)aswellas26%and43%ofphishingand the most popular Common-Name for phishing (16%), typosquatting, respectively. Such high percentages of whereas it is only 6% in Alexa. (The same holds expired certificates are less surprising for phishing and for the common-name “UTN”). Also, the percentage typosquatting than for legitimate domains. There is a of certificates without common names (“JustNone”) is non-negligibledifferenceintheratiosbetweenphishing larger in Alexa (16%) than in phishing (4%). We and Alexa (5%). make similar observations for organization name. For F5 (verificationfailed): we use OpenSSL 0.9.8k[12] example, “Verisign” represents 10% of Alexa and 2.5% tovalidatecertificatesandfindthat29%ofAlexacertifi- in phishing. Also, “SomeOrganization” accounts for cates (40%and 37%of .com and .net) fail verification. 11% in phishing and 7% in Alexa. Such observations Also, 36% and 70% of phishing and typosquatting alone are not enough to discriminate phishing from certificatesfailverification.Thepercentagesin.comand popular/legitimate domains, but they can be combined .net sets are higher than in phishing maybe because with others for more effective discrimination. It is also these are random domains and mostly unpopular thus, interestingtoseethatanon-trivialpercentageofphishing having a valid certificate is not essential to these do- domains still obtain certificates from legitimate CAs. mains. F11 and F12 (issuer and subject countries): the USA (US) and South Africa (ZA) are responsible for F6(commoncertificate)andF7(commonserialnum- around 50% and 20% (respectively) of the certificates ber): F6indicateswhetherthecertificateisalsousedfor for both legitimate/popular and malicious domains. All multiple domains in our data sets. 33% of certificates other countries account only for 30%. This is expected in Alexa are duplicated (60% and 64% of .com and since several major CAs are located US and ZA. The .net, respectively). The percentage goes up to 72% granularity of countries is too coarse to derive useful for phishing and 95% for typosquatting. F7 indicates informationand the field can be easily filled with bogus whether a certificate serial number is used in another values. certificateinourdatasets.38%ofAlexacertificateshave commonserial numbers(33%were entirely duplicated). Features F13 to F15 have integer/real values cor- Also, 64% and 68% of .com and .net as well as 75% responding to: length of certificate validity period (in and 96% in phishing and typosquatting satisfy F7. days),lengthofserialnumber,andJaccardDistance[24] Certificates in phishing and typosquatting have betweenthehostnameandthecommonnameofthesub- higher duplication ratios than legitimate data sets (espe- ject. Results of the analysis are shown using cumulative ciallyAlexa)suggestingthatF7,perhapscombinedwith other features, could be helpful in identifying phishing 4We present only the result of Alexa and phishing due to space and typosquatting domains. limitations butsimilarconclusions wereobtained forotherdatasets 5 1 1 1 1 0.8 0.8 0.8 0.8 0.6 0.6 0.6 0.6 CDF CDF CDF CDF 0.4 0.4 0.4 0.4 0.2 0.2 0.2 0.2 0 0 200 C e40rt0ifica(tea Va)lida t6io00n Period (in d 8a0y0s) P1h0i0As.0hcl.enoinxemagt 1200 0 0 200 C40e0rtifi(cabte V)alid 6ity0 0Period (in da 8y0s0) com+n 1e0t 0At.y0cl.epnoxoemast 1200 0 0 0.2Jaccard Distan(c 0e. a4bet)ween Host an 0d. 6Common Name 0.8PhiAs.hcl.enoinxemagt 1 0 0 0.2Jaccard Distan(c 0e.b 4betw)een Host an 0d. 6Common Nacmoem 0+.n8et At.ycl.epnoxoemast 1 1 1 Fig. 3. CDF of Jaccard Distance between Host and 0.8 0.8 Common Name in Subject Field between Alexa, .com, .net and : (a) phishing(b) typosquatting 0.6 0.6 CDF CDF 0.4 0.4 0.2 0.2 F15 (host-common name distance): we expect the 0 0 10 20 30 40PhiAs.hcl.enoinxemgat 50 0 0 10 20 30 com +4n0et At.ycl.epnoxoemast 50 common name in the subject field to be very similar Certif(icacte S)erial Number Length Certif(icadte S)erial Number Length to the hostname in legitimate certificates. We intuitively expect that the difference between common name and Fig.2. CDF of Validity Period of Alexa, .com, .net and: (a)phishing(b)typosquattingandSerialNumberLength hostnameinmaliciousdomaincertificatesisgreaterthan (c) phishing (d) typosquatting that in legitimate certificates, since malicious domains may not use complying SSL certificates. This feature captures the Jaccard Distance [24] between hostname distribution function (CDF) plots in Figures 2, and 3. and common-name. The Jaccard Distance measures the F13 (exact validity duration): Figure 2(a) compares closeness between two strings (A and B) and is defined the validity period CDF for phishing and legitimate as J(A,B) = (A and B)/(A or B). J(A,B) is basically the sets. It shows similarities among the distributions: in size of the intersection between set A and B divided Alexa, 32% are valid for 365 days and 10% for 730, by the size of their union. Thus, the larger the Jaccard as opposed to 32% for 365 days and 17% for 731 in distance between two strings, the more similar they are. phishing. phishing has a gradual increase in validity Figure 3(a) compares the CDF of Jaccard distance of periods over 731, whereas Alexa has a small jump phishing and other legitimate sets. As expected, phish- around1,095(3%).Thelongestvalidityperiodis4,096 ingcertificatestendtohavesmallerJaccarddistancethan days in phishing and 2,918,805 in Alexa. Figure 2(b) thoseinAlexa.30%inAlexahaveaJaccardDistanceof compares typosquatting with the legitimate sets. The 0 (5%of1, 4%of 0.7andseveral3-4%jumpsfrom0.6 figure clearly shows that typosquatting certificates are until0.9).Forphishing 43%haveaJaccardDistanceof validforsmallernumberofdaysthanthoseinlegitimate 0 (4%a distance of1 andthe mainjumpsexistbetween sets. This difference in distributions is larger than that 0 and 0.5 as opposed to 0.6 to 0.9 in Alexa). The of phishing. distribution of .com (.net) shows a trend differentfrom F14(serialnumberlength): agenuineSSLcertificate Alexa. We observe that .com (.net) certificates have issuedbyareputableCAismorelikelytocontainalong smaller Jaccard Distance than phishing. For example, serial number, which assures uniqueness. The CDF for 90% (93%) of certificates in .com (.net) have Jaccard serialnumberlengthforalldatasetsinFigure2(c)shows distance ≤ 0.2. Whereas, 71% in phishing have a thatphishingcertificatestendtohaveshorterserialnum- Jaccarddistanceof≤ 0.2.Thismightbeduetothefact ber length than legitimate certificates, especially, those thatdomainsin .com and.net sets are randomdomains in Alexa. Around 40% in Alexa have a serial number andmostlyunpopular(theintersectionswithAlexa’stop of length 39 (18.5% of 13, 11% of 15, 11%, 10% of 9, 100kdomainsarenegligible).Thus,havingacomplying 10%of23and6%of11).Thedistributionisdifferentfor SSL certificate is not essential to these domains. Figure phishing,where29%havelengthof39(22%of13,16% 3(b)comparestheJaccarddistanceCDFoftyposquatting of 9, 11% of 23, 10% of 11 and 7% of 15). Figure 2(d) with legitimate sets. It clearly shows that typosquatting showsthattyposquattingcertificatestendtohaveshorter certificates tend to have smaller Jaccard distance. serial number than certificates in legitimate sets. Also, Summary of Certificate Feature Analysis. The most the differencebetween typosquattingand legitimate sets important observations from our analysis are: distributions is larger than that between phishing and legitimate sets. 1) Distributions of malicious sets are significantly 6 different from those of legitimate sets for several Decision Trees: Bagging [17], [37] and Boosting [23], features. [37]. Algorithm descriptions are out of the scope; we 2) Around 20% of legitimate popular domains referto [18],[19],[26],[36],[28], [32],[32],[17],[37], are still using the signature algorithm [23], [37] for relevant background information. “md5WithRSAEncryption“ despite its clear We use precision-recall performance metrics to sum- insecurity. marize the performance of a classifier. These metrics 3) A significant percentage (> 30%) of legitimate consistsofthe followingratios:Positive Recall, Positive domain certificates are expired and/or self-signed. Precision, Negative Recall, and Negative Precision. We Several user studies (e.g. [45] and [21]) show use the term “Positive” set to denote malicious domains that an overwhelming percentage (up to 70-80% (phishingortyposquatting)and“Negative”settoreferto in some cases) of users ignore browser warning legitimate domains. The positive (negative) recall value messages. This may explain the unexpected high is the fraction of positive (negative) domains that are percentages of expired and self-signed certificates correctly identified by the classifier. Positive (negative) that we find in the results of all our data sets. precision is the actual fraction of positive (negative) Users simply ignorethose warningsand thisgives domains in the set identified by the classifier. no incentive for organizations to update and pay Weevaluatetheperformanceoftheclassifierusingthe attention to their certificates. ten-foldcrossvalidationmethod[32].Init,werandomly 4) Duplicate certificate percentages are very high in dividethe data setinto10sets ofequalsize, perform10 phishing domains, which points to the possibility different training/testing steps where each step consists that phishers use certificates only to display the oftrainingaclassifieron9setsandthentestingitonthe familiarlockiconinthebrowser,hopingthatusers remaining set. We take the average of all results as the will not pay attention. final performance result of the classifier. We use most 5) For some features, the difference in distributions of the features discussed in Section III-B to build two between malicious and legitimate sets is small. binaryclassifiers: (1) a PhishingClassifier and(2) a Ty- However, these features turn out to be useful in posquatting Classifier. The phishing/typosquatting clas- building a classifier (see Section IV). sifier distinguishes non-phishing/non-typosquatting do- 6) For most features, the difference in distributions mains from phishing/typosquatting domains using only between Alexa and malicious sets is larger than information from SSL certificates. that between .com/.net and malicious sets. A. Phishing Classifier IV. CERTIFICATE-BASEDCLASSIFIER We tried different combinations of features in the The analysis in Section III-B showed that several classifier andthefollowingyieldedthebestresults:(F1) featureshavedistributionsinlegitimatedatasetsthatare md5, (F2) bogus subject, (F3) self-signed, (F4) expired, significantlydifferentfromthoseinmalicioussets.Even (F6) common certificate, (F7) common serial number, though some show significant differences, relying on a (F8) validity period (F9) issuer common name, (F10) single feature to identify maliciousdomainswill yield a issuer organization name, (F11) issuer country, (F12) high rate of false positives. One approach, is to simply subject country, (F14) serial number length and (F15) use a set of heuristics. For example, if F1-F8 are all host-common name distance. true, the given domain is malicious with overwhelming We create two training data sets for two different probability. However, this might necessitate perform- purposes. The first consists of 840 SSL certificates half ing an exhaustive search over many combinations of of which are for phishing (positive set) and the other heuristics, in order to find the optimal one. Another half – for legitimate (negative set) domains. Legitimate possibility is to take advantage of machine-learning certificates are those of the 210 top Alexa domains,105 algorithms and let them find the best combination of random .com domains and 105 random .net domains. features. Machine learning algorithms are well suited That is, half of legitimate certificates are for popular for classification problems. We use several machine- domainsandtheotherhalfareforlegitimaterandomdo- learning-based classification algorithms and select the mains.Thepurposeoftrainingonthisdatasetistocreate best performer. Specifically, we consider the following aclassifier thatdifferentiatesbetweenphishingandnon- algorithms: Random Forest [18], Support Vector Ma- phishing domains. Table IV shows performance metric chines [19], [26], Decision Tree [36], Decision Table values for different classifiers. The Bagging classifier [28], Nearest Neighbor [32] and Neural Networks [32]. shows the highest phishing detection accuracy. Despite In addition, we explore two optimization techniques for the limited number of fields in the certificates and the 7 Classifier Positive Positive Negative Negative lutions, the larger the training data set, the better the Recall Precision Recall Precision RandomForest 0.74 0.77 0.78 0.75 performance is. We believe that our solution may incur SVM 0.68 0.75 0.78 0.71 false positives when actually deployed. However, the DecisionTree 0.70 0.79 0.81 0.73 number of false positives can be reduced by training on Bagging- 0.73 0.80 0.81 0.75 larger data sets and newer samples. DecisionTree Boosting- 0.74 0.69 0.67 0.72 We note that the use of the classifier alone can- DecisionTree not successfully detect all phishing web sites, since it DecisionTable 0.72 0.78 0.8 0.74 only works with HTTPS. However, our approach can NearestNeighbor 0.74 0.73 0.73 0.74 NeuralNetworks 0.7 0.77 0.8 0.73 be combined with other phishing mitigation techniques (e.g., [50], [27], [22]) to enhance the overall accuracy. TABLEIV PERFORMANCEOFCLASSIFIERS-DATASETCONSISTSOF:(A)420 Our resultsshow howclearly limited informationpulled PHISHINGCERTIFICATESAND(B)420NON-PHISHINGCERTIFICATES from SSL certificates (which has been overlooked be- (ALEXA,.COMAND.NET). fore) can help in distinguishing phishing and legitimate domains in the context of claimed-to-be secure naviga- tion. Classifier Positive Positive Negative Negative Recall Precision Recall Precision B. Typosquatting Classifier RandomForest 0.90 0.89 0.89 0.9 SVM 0.91 0.87 0.86 0.90 We use the same set of features as in the phishing DecisionTree 0.86 0.89 0.90 0.87 classifier and also create two training data sets: the first Bagging- 0.90 0.88 0.87 0.9 with certificates of 486 domains from the typosquatting DecisionTree Boosting- 0.89 0.81 0.80 0.88 (positive) data set (see Section III-A) and 486 non- DecisionTree typosquatting (negative) domains. Half of the negatives DecisionTable 0.90 0.84 0.82 0.90 arepickedrandomlyfromAlexaandtherestarerandom NearestNeighbor 0.87 0.86 0.86 0.87 NeuralNetworks 0.87 0.89 0.90 0.87 .comand.net domains(distributedequallybetweenthe two).Thistrainingsettrainstheclassifiertodifferentiate TABLEV PERFORMANCEOFCLASSIFIERS-DATASETCONSISTSOF:(A)420 between typosquatting and non-typosquatting domains. PHISHINGDOMAINCERTIFICATESAND(B)420POPULARDOMAIN Performanceof variousclassifiers is shown in Table VI. CERTIFICATES(ALEXA) Themetricsarefairlysimilaracrossclassifiers,however, Decision Tree offers the highest accuracy with a rate of 93%. Theseconddatasetconsistsofthesametyposquatting dependenciesamongthem,wecanbuildaclassifierwith certificates in the previous data set and certificates of 80% phishing detection accuracy. 486 Alexa top domains. This set is needed to train The second training data set consists of 840 SSL the classifier to differentiate, typosquatting and popular certificates. Half of these are for phishing domains domains. Performance results are shown in Table VII. (positiveset)andtherest–legitimatecertificatesfor420 The performance of many classifiers improves because top Alexa’s domains (negative set). The purpose is to thedifferenceindistributionsoffeaturesbetweenAlexa train the classifier to differentiatephishingfrom popular and typosquatting data sets is larger than the difference domains. Performance results of different classifiers are between .com (.net) and typosquatting data sets. The showninTableV.Theresultsimprovesignificantlyfrom Neural Network classifier shows the highest typosquat- thoseinthefirsttrainingdataset.RandomForest(along ting detection accuracy, with a rate of 95%. with few others) exhibits the best phishing detection accuracy with a rate of 89%. The reason is that (as V. DISCUSSION shown in Section III-B) the difference in distributions Based on the measurementspresented in the previous of many features between Alexa and phishing is much sections, we find that a significant percentage of well- largerthanthatbetween.com(.net)andphishing.Thus, known domains already use HTTPS (alongside HTTP), the random .com and .net certificates are polluting the thus it is possible to harvest their certificates for clas- non-phishing set and making it harder to differentiate. sifying them, without requiring any modifications from Although the number of certificate fields is small, we the domains’ side. Furthermore, the non-trivial portion canreliablydifferentiatephishingfrompopulardomains of phishing web sites using HTTPS highlights the need with high accuracy. toanalyzeandcorrelatetheinformationprovidedintheir Similar to machine-learning-based spam filtering so- server certificates. 8 Classifier Positive Positive Negative Negative it contributes to face a complete range of malicious Recall Precision Recall Precision RandomForest 0.86 0.88 0.88 0.87 domains. SVM 0.86 0.88 0.89 0.86 Finally, we note that our classifier does not identify DecisionTree 0.84 0.93 0.93 0.85 all kinds of typosquatting domains. For instance, typo- Bagging- 0.87 0.90 0.90 0.87 DecisionTree domains displaying unexpected pages and ads might Boosting- 0.89 0.85 0.84 0.88 have no incentive to use HTTPS, since they do not DecisionTree attempt to elicit private information from users. How- DecisionTable 0.86 0.87 0.87 0.86 NearestNeighbor 0.86 0.84 0.84 0.86 ever, in the case of a further increase of HTTPS usage, NeuralNetworks 0.84 0.89 0.90 0.85 our technique will be as a consequence increasing its TABLEVI effectiveness.Generictyposquattingdetectionremainsat PERFORMANCEOFCLASSIFIERS-DATASETCONSISTSOF:(A)486 the momentan interesting openissue thathas notfound TYPOSQUATTINGDOMAINCERTIFICATESAND(B)486 effective solutions so far. NON-TYPOSQUATTINGDOMAINCERTIFICATES(TOPALEXA’S DOMAINCERTIFICATES,.COMAND.NETRANDOMCERTIFICATES) Using information in certificates. Our results show significant differences between certificates of legitimate domainsandthoseofmaliciousdomains.Notonlyisthis Classifier Positive Positive Negative Negative informationalonesufficienttodetectfraudulentactivities Recall Precision Recall Precision RandomForest 0.95 0.93 0.93 0.95 as we have shown, but it is also a useful component in SVM 0.94 0.92 0.92 0.94 assessing a web site’s degree of trustworthiness, thus DecisionTree 0.95 0.94 0.93 0.95 improving prior metrics, such as [50], [27], [22]. Our Bagging- 0.96 0.94 0.93 0.96 methodshouldbeintegratedwithothertechniquestoim- DecisionTree Boosting- 0.98 0.90 0.90 0.97 prove the effectiveness of detecting malicious domains. DecisionTree Keeping state of encountered certificates. We deliber- DecisionTable 0.96 0.92 0.92 0.96 NearestNeighbor 0.93 0.94 0.94 0.92 ately chose to conduct our measurements as general as NeuralNetworks 0.95 0.95 0.95 0.95 possible,withoutrelyingonusernavigationhistoryoron TABLEVII userspecifictrainingdata.Thesecomponentsarefunda- PERFORMANCEOFCLASSIFIERS-DATASETCONSISTSOF:(A)486 mentalformostcurrentmitigationtechniques[22],[35]. TYPOSQUATTINGDOMAINCERTIFICATESAND(B)486POPULAR Moreover, we believe that keeping track of navigation DOMAINCERTIFICATES(TOPALEXA’SDOMAINCERTIFICATES) historyisdetrimentaltouserprivacy.However,ourwork yields effective detection by analyzing certain coarse- grained information extracted from server certificates Limitations. We acknowledge that our data sets should and not specific to a user’s navigation patterns. This beextendedthroughlongerobservationperiods.Further- is not privacy compromising as keeping fine-grained more, additional data sets of legitimate domains need navigation history. to be taken into consideration, e.g. popular web sites How malicious domains will adapt. Web fraud detec- from DNS logs in different organizationsand countries. tion is an arms race and Web-fraudstersare diligentand Data sets of typosquatting domains can be strengthened quickly adapt to new security mechanisms and studies by additional and more effective name variations. Also, that threaten their business. We hope that this work we acknowledge that our phishing(typosquatting) clas- will raise the bar and make it more difficult for Web- sifier may incur false positives when actually deployed. fraudsters to deceive users. If web-browsers use our However, this is a common problem to many machine- classifier and start analyzing SSL certificate fields in learning-based mitigation solutions (spam filtering and more detail, we expect there will be two scenarios for intrusion detections based on machine-learning algo- web-fraudstersto adapt:(1) to get legitimate certificates rithms) and the number of false positives can be min- from CAs and leave a paper trail pointing to ”some imized by training the classifier on larger and more person or organization” which is connected to such comprehensive data sets (and continuing to do so for an activity, (2) to craft certificates that have similar new samples). Our classifier does not provide a com- values like those that are the most common in those plete standalone solution to the phishing(typosquatting) of legitimate domains. Some fields/featureswill be easy threatsincemanydomainsdonothaveHTTPS. Instead, to forge with legitimate values (e.g., country of issuer, integrated with pre-existing solutions (e.g., [50], [27], country of subject, subject common and organization [22]), it improves their effectiveness in the context of name,validityperiod,signaturealgorithm,serialnumber potentially critical applications enforcing HTTPS and ...etc)butforsomethiswillnotbepossible(issuername, 9 signature...etc)becauseotherwisethe verificationofthe the infrastructure used for hosting and supporting Inter- certificate will fail. In either case the effectiveness of net scams, including phishing. It used an opportunistic web-fraud will be reduced. measurement technique that mined email messages in real-time, followed the embedded link structure, and VI. RELATEDWORK automaticallyclustereddestinationwebsitesusingimage Theauthorsin[25]conductedameasurementstudyto shingling. In [22], a machine learning based method- measure the cryptographicstrength of more than 19,000 ology was proposed for detecting phishing email. The public servers running SSL/TLS. The study reported methodology was based on a classifier that detected that many SSL sites still supported the weak SSL 2.0 phishing with 96% accuracy and false negative rate protocol. Also, most of the probed servers supported of 0.1%. Our work differs since it does not rely on DES, which is vulnerable to exhaustive search. Some phishing emails which are sometimes hard to identify. of the sites used RSA-based authentication with only An anti-phishing browser extension (AntiPhish) was 512-bit key size, which is insufficient. Nevertheless, the given in [27]. It kept track of sensitive information authors showed encouraging measurement results such and warned the user whenever the user tried to en- as the use of AES as default option for most of the ter sensitive information into untrusted web sites. Our servers that did not support AES. Also, a trend toward classifier can be easily integrated with AntiPhish. How- usingstrongercryptographicfunctionhasbeenobserved ever, AntiPhish compromised user privacy by keeping over two years of probing, despite a slow improvement. state of sensitive data. Other anti-phishing proposals In[39],theauthorsperformedasix-monthmeasurement relied on trusted devices, such as a user’s cell phone study on the aftermath of the discovered vulnerability in [35]. In [31], the authors tackled the problem of in OpenSSL in May 2008, in order to measure how detecting malicious web sites by only analyzing their fast the hosts recovered from the bug and changed their URLs using machine-learning statistical techniques on weak keys into strong ones. Through the probing of the lexical and host-based features of their URLs. The thousands of servers, they showed that the replacement proposedsolutionachievedapredictionaccuracyaround of the weak keys was slow. Also, the speed of recovery 95%.Otherstudiesmeasuredtheextentoftyposquatting was shown to be correlated to different SSL certificate and suggested mitigation techniques. Wang, ed al. [47] characteristics such as: CA type, expiry time, and key showed that many typosquatting domains were active size. The article in [33] presented a profiling study of and parked with a few parking services, which served a set of SSL certificates. Probing around 9,754 SSL adsonthem.Similarly,[16]showedthat,fornearly57% servers and collecting 8,081 SSL certificates, it found of original URLs considered, over 35% of all possible thataround30%ofrespondingservershadweaksecurity URLvariationsexistedontheInternet.Surprisingly,over (small key size, supporting only SSL 2.0,...), 10% of 99%ofsuch similarly-namedwebsites wereconsidered them were already expired and 3% were self-signed. phony. [30] devised a methodology for identifying ads- Netcraft [34] conducts a monthly survey to measure the portals and parked domains and found out that around certificatevalidityofInternetservers.Recently,thestudy 25% of (two-level) .com and .net domains were ads- showed that 25% of the sites had their certificates self- portals and around 40% of those were typosquatting. signed and less than half had their certificates signed by McAfee also studied the typosquatting problem in [41]. validCA.Symantec[38]hasobservedanincreaseinthe Asetof1.9millionsingle-errortyposwasgeneratedand number of URLs abusing SSL certificates. Only in the 127,381 suspected typosquatting domains were discov- periodbetweenMayandJune 2009,26%ofall theSSL ered. Alarmingly, the study also found that typosquat- certificate attacks have been performed by fraudulent terstargetedchildren-orienteddomains.Finally,McAfee domains using SSL certificates. Although our measure- added to its extension site advisor [7] some capabilities ment study conducts a profiling of SSL certificates, our for identifying typosquatters. purpose is different from the ones above. We analyze the certificates to show how malicious certificates are User studies analyzing the effectiveness of browser differentfrombenignonesandtoleveragethisdifference warning messages indicated that an overwhelming per- in designing a mitigation technique. centage (up to 70-80% in some cases) of users ignored The importance and danger of web-fraud (such as them. This observation—confirmed by recent research phishing and typosquatting) has been recognized in results (e.g., [45] and [21])— might explain the un- numerouspriorpublications,studiesandindustryreports expected high percentage of expired and self signed mainly due to the tremendous financial losses [3] that certificates that we found in the results of all our data it causes. One notable study is [15] which analyzed sets. In [49], the authorsproposeda solutionto simplify 10

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.