ebook img

Malicious URL Detection using Machine Learning: A Survey PDF

0.81 MB·
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Malicious URL Detection using Machine Learning: A Survey

1 Malicious URL Detection using Machine Learning: A Survey Doyen Sahoo, Chenghao Liu, and Steven C.H. Hoi Abstract—Malicious URL, a.k.a. malicious website, is a com- variety of techniques to implement such attacks, such as ex- mon and serious threat to cybersecurity. Malicious URLs host plicit hacking attempts, drive-by exploits, social engineering, unsolicited content (spam, phishing, drive-by exploits, etc.) and phishing, watering hole, man-in-the middle, SQL injections, 7 lure unsuspecting users to become victims of scams (monetary loss/theft of devices, denial of service, distributed denial of 1 loss, theft of private information, and malware installation), and 0 cause losses of billions of dollars every year. It is imperative to service, and many others. Considering the variety of attacks, 2 detectandactonsuchthreatsinatimelymanner.Traditionally, potentially new attack types, and the innumerable contexts in this detection is done mostly through the usage of blacklists. which such attacks can appear, it is hard to design robust r However, blacklists cannot be exhaustive, and lack the ability a systems to detect cyber-security breaches. The limitations of M to detect newly generated malicious URLs. To improve the gen- traditional security management technologies are becoming erality of malicious URL detectors, machine learning techniques havebeenexploredwithincreasingattentioninrecentyears.This more and more serious given this exponential growth of new 6 article aims to provide a comprehensive survey and a structural security threats, rapid changes of new IT technologies, and 1 understanding of Malicious URL Detection techniques using significant shortage of security professionals. Most of these ] machine learning. We present the formal formulation of Mali- attacking techniques are realized through spreading compro- G cious URL Detection as a machine learning task, and categorize mised URLs (or the spreading of such URLs forms a critical and review the contributions of literature studies that addresses L different dimensions of this problem (feature representation, part of the attacking operation) [1]. . algorithmdesign,etc.).Further,thisarticleprovidesatimelyand URL is the abbreviation of Uniform Resource Locator, s c comprehensivesurveyforarangeofdifferentaudiences,notonly which is the global address of documents and other resources [ formachinelearningresearchersandengineersinacademia,but on the World Wide Web. A URL has two main components 2 atolsohefloprtphreomfesusniodnearlsstaannddtphreacsttiatitoenoefrsthinecayrbtearnsedcufarcitilyitiantdeutshteriyr, : (i) protocol identifier, it indicates what protocol to use, (ii) v ownresearchandpracticalapplications.Wealsodiscusspractical resource name, it specifies the IP address or the domain name 9 issues in system design, open research challenges, and point out where the resource is located. The protocol identifier and the 7 some important directions for future research. resource name are separated by a colon and two forward 1 7 Index Terms—Malicious URL Detection, Machine Learning, slashes. An example is shown in Figure 1. 0 Online Learning, Internet security, Cybersecurity Compromised URLs that are used for cyber attacks are . termed as malicious URLs. In fact, it was noted that close to 1 one-thirdofallwebsitesarepotentiallymaliciousinnature[2], 0 I. INTRODUCTION 7 demonstrating rampant use of malicious URLs to perpetrate The advent of new communication technologies has had 1 cyber-crimes. A Malicious URL or a malicious web site hosts : tremendousimpactinthegrowthandpromotionofbusinesses avarietyofunsolicitedcontentintheformofspam,phishing, v spanning across many applications including online-banking, i or drive-by-exploits in order to launch attacks. Unsuspecting X e-commerce, and social networking. In fact, in today’s age users visit such web sites and become victims of various r it is almost mandatory to have an online presence to run a types of scams, including monetary loss, theft of private a successful venture. As a result, the importance of the World information (identity, credit-cards, etc.), and malware installa- Wide Web has continuously been increasing. Unfortunately, tion. Popular types of attacks using malicious URLs include: the technological advancements come coupled with new so- Drive-by Download, Phishing and Social Engineering, and phisticated techniques to attack and scam users. Such attacks Spam [3]. Drive-by-download [4] refers to the (unintentional) include rogue websites that sell counterfeit goods, financial download of malware upon just visiting a URL. Such attacks fraud by tricking users into revealing sensitive information are usually carried out by exploiting vulnerabilities in plugins which eventually lead to theft of money or identity, or even or inserting malicious code through JavaScript. Phishing and installing malware in the user’s system. There are a wide Social Engineering attacks [5] trick the users into revealing private or sensitive information by pretending to be genuine Doyen Sahoo is with the School of Information Systems, Singapore ManagementUniversity,Singapore,email:[email protected] web pages. Spam is the usage of unsolicited messages for Chenghao Liu was with the School of Information Systems, Singapore the purpose of advertising or phishing. These types of attacks ManagementUniversity,Singapore,email:[email protected] occur in large numbers and have caused billions of dollars Corresponding author: Steven C.H. Hoi is with the School of In- formation Systems, Singapore Management University, Singapore, email: worth of damage every year. Effective systems to detect such [email protected] malicious URLs in a timely manner can greatly help to This work has been submitted to the IEEE for possible publication. counterlargenumberofandavarietyofcyber-securitythreats. Copyright may be transferred without notice, after which this version may nolongerbeaccessible. Consequently, researchers and practitioners have worked to 2 Fig.1. ExampleofaURL-“UniformResourceLocator” design effective solutions for Malicious URL Detection. primary requirement for training a machine learning model is the presence of training data. In the context of malicious The most common method to detect malicious URLs de- URLdetection,thiswouldcorrespondtoasetoflargenumber ployed by many antivirus groups is the black-list method. of URLs. Machine learning can broadly be classified into Black-lists are essentially a database of URLs that have supervised, unsupervised, and semi-supervised, which corre- been confirmed to be malicious in the past. This database is spondtohavingthelabelsforthetrainingdata,nothavingthe compiled over time (often through crowd-sourcing solutions, labels, and having labels for limited fraction of training data. e.g. PhishTank [6]), as and when it becomes known that a Labels correspond to the knowledge that a URL is malicious URLismalicious.Suchatechniqueisextremelyfastduetoa or benign. simple query overhead, and hence is very easy to implement. Additionally, such a technique would (intuitively) have a very After the training data is collected, the next step is to low false-positive rate (although, it was reported that often extractinformativefeaturessuchthattheysufficientlydescribe blacklisting suffered from non-trivial false-positive rates [7]). the URL and at the same time, they can be interpreted However,itisalmostimpossibletomaintainanexhaustivelist mathematically by machine learning models. For example, of malicious URLs, especially since new URLs are generated simply using the URL string directly may not allow us to everyday.Attackersusecreativetechniquestoevadeblacklists learn a good prediction model (which in some extreme cases and fool users by modifying the URL to “appear” legitimate mayreducethepredictionmodeltoablacklistmethod).Thus, via obfuscation. Garera et. al. [8] identified four types of one would need to extract suitable features based on some obfuscation:ObfuscatingtheHostwithanIP,Obfuscatingthe principles or heuristics to obtain a good feature representation Hostwithanotherdomain,Obfuscatingthehostwithlargehost of the URL. This may include lexical features (statistical names, and misspelling. All of these try to hide the malicious propertiesoftheURLstring,bagofwords,n-gram,etc.),host- intentions of the website by masking the malicious URL. based features (WHOIS info, geo-location properties of the Recently, with the increasing popularity of URL shortening host, etc.), etc. These features after being extracted have to be services, it has become a new and widespread obfuscation processedintoasuitableformat(e.g.anumericalvector),such technique(hidingthemaliciousURLbehindashortURL)[9], thattheycanbepluggedintoanoff-the-shelfmachinelearning [10]. Once the URLs appear legitimate, and user’s visit them, method for model training. The ability of these features to an attack can be launched. This is often done by malicious provide relevant information is critical to subsequent machine code embedded into the JavaScript. Often the attackers will learning, as the underlying assumption of machine learning also try to obfuscate the code so as to prevent signature based (classification) models is that the feature representations of tools from detecting them. Attackers use many other simple the malicious and benign URLs have different distributions. techniques to evade blacklists including: fast-flux, in which Therefore,thequalityoffeaturerepresentationoftheURLsis proxies are automatically generated to host the web-page; al- criticaltothequalityoftheresultingmaliciousURLpredictive gorithmicgenerationofnewURLs;etc.Additionally,attackers model learned by machine learning. can often simultaneously launch more than one attack, which Finally, using the training data with the appropriate fea- alterstheattack-signature,makingitundetectablebytoolsthat ture representation, the next step in building the prediction focus on specific signatures. Blacklisting methods, thus have model is the actual training of the model. There are plenty severelimitations,anditappearsalmosttrivialtobypassthem, of classification algorithms can be directly used over the especiallyduetothefactthatblacklistsareuselessformaking training data (Naive Bayes, Support Vector Machine, Logistic predictions on new URLs. Regression, etc.). However, there are certain properties of To overcome these issues, in the last decade, researchers the URL data that may make the training difficult (both in have applied machine learning techniques for Malicious URL terms of scalability and learning the appropriate concept). For Detection [3], [8], [11]–[17]. Machine Learning approaches, example, the number of URLs available for training can be use a set of URLs as training data, and based on the sta- in the order of millions (or even billions). As a result, the tistical properties, learn a prediction function to classify a training time for traditional models may be too high to be URL as malicious or benign. This gives them the ability to practical. Consequently, Online Learning [18], a family of generalize to new URLs unlike blacklisting methods. The scalable learning techniques have been heavily applied for 3 this task. Similarly, for this task, URLs are represented using are known to be malicious. Whenever a new URL is visited, the bag-of-words (BoW) features. These features basically a database lookup is performed. If the URL is present in the indicate whether a particular word (or string) appears in a blacklist, it is considered to be malicious and then a warning URL or not - as a result every possible type of word that may willbegenerated;elseitisassumedtobebenign.Blacklisting appear in any URL becomes a feature. This representation suffers from the inability to maintain an exhaustive list of may result in millions of features which would be very sparse all possible malicious URLs, as new URLs can be easily (most features are absent most of the time, as a URL will generated daily, thus making it impossible for them to detect usuallyhaveveryfewofthemillionsofpossiblewordspresent new threats [21]. This is particularly of critical concern when in it). Accordingly, a learning method should exploit this the attackers generate new URLs algorithmically, and can sparsity property to improve learning efficiency and efficacy. thus bypass all blacklists. Despite several problems faced by Despitethepromisinggeneralizingabilityofmachinelearning blacklisting [7], due to their simplicity and efficiency, they approaches, one potential shortcoming of these approaches continue to be one of the most commonly used techniques by for malicious URL detection may be their resource intensive many anti-virus systems today. nature(especiallywhileextractingfeaturesthatarenon-trivial Heuristic approaches [22] are some kind of extensions andexpensivetocompute),reducingtheirpracticalvaluewhen of Blacklist based methods, wherein the idea is to create requiringreal-timesecurityassurancecomparedtoblacklisting a “blacklist of signatures”. Common attacks are identified, methods. and based on their behaviors, a signature is assigned to this Inthissurvey,wereviewthestate-of-the-artmachinelearn- attack type. Intrusion Detection Systems can scan the web ing techniques for malicious URL detection in literature. pages for such signatures, and raise a flag if some suspicious We specifically focus on the contributions made for feature behavior is found. These methods have better generalization representation and learning algorithm development in this capabilitiesthanblacklisting,astheyhavetheabilitytodetect domain. We systematically categorize the various types of threats in new URLs as well. However, such methods can be feature representation used for creating the training data for designed for only a limited number of common threats, and this task, and also categorize various learning algorithms used can not generalize to all types of (novel) attacks. Moreover, to learn a good prediction model. We also discuss the open usingobfuscationtechniques,itisnotdifficulttobypassthem. research problems and identify directions for future research. A more specific version of heuristic approaches is through In the rest of the survey, we first discuss the broad categories analysis of execution dynamics of the webpage (e.g. [23]– of strategies used for detecting malicious URLs - Blacklists, [27] etc.). Here also, the idea is to look for a signature of Heuristic and Machine Learning. We formalize the setting as malicious activity such as unusual process creation, repeated a machine learning problem, where the primary requirement redirection,etc.Thesemethodsnecessarilyrequirevisitingthe isgoodfeaturerepresentationandthelearningalgorithmused. webpageandthustheURLsactuallycanmakeanattack.Asa We then comprehensively present various types of feature result,suchtechniquesareoftenimplementedincontrolleden- representation used for this problem. This is followed by vironment like a disposable virtual machine. Such techniques presenting various algorithms that have been used to solve are very resource intensive, and require all execution of the this task, and have been developed based on the properties of code (including the rich client sided code). Another drawback URL data. Finally we discuss the newly emerging concept of is that websites may not launch an attack immediately after Malicious URL Detection as a service and the principles to being visited, and thus may go undetected. be used while designing such a system. We end the survey 2) Machine Learning: These approaches try to analyze by discussing the practical issues and open problems in this the information of a URL and its corresponding websites domain. or webpages, by extracting good feature representations of URLs,andtrainingapredictionmodelontrainingdataofboth II. MALICIOUSURLDETECTION malicious and benign URLs. There are two-types of features that can be used - static features, and dynamic features. In this section, we first present the key principles used by In static analysis, we perform the analysis of a webpage researchersandpractitionertosolvetheproblemofMalicious based on information available without executing the URL URL detection, followed by formalizing it as a machine (i.e., executing JavaScript, or other code) [12], [13], [20], learning task. [28]. The features extracted include lexical features from the URL string, information about the host, and sometimes even A. Principles of Detecting Malicious URLs: An Overview HTMLandJavaScriptcontent.Sincenoexecutionisrequired, A variety of approaches have been attempted to tackle the these methods are safer than the Dynamic approaches. The problem of Malicious URL Detection. According to the fun- underlyingassumptionisthatthedistributionofthesefeatures damental principles, these approaches can be broadly grouped is different for malicious and benign URLs. Using this distri- into two major categories: (i) Blacklisting or Heuristics, and bution information, a prediction model can be built, which (ii)MachineLearningapproaches[19],[20].Belowwebriefly can make predictions on new URLs. Due to the relatively describe the key principles of each category. safer environment to extracting important information, and 1) Blacklisting or Heuristic Approaches: Blacklisting ap- the ability to generalize to all types of threats (not just proaches are a common and classical technique for detecting common ones which have to be defined by a signature), malicious URLs, which often maintains a list of URLs that static analysis techniques have been extensively explored by 4 Fig.2. AgeneralprocessingframeworkforMaliciousURLDetectionusingMachineLearning applyingmachinelearningtechniques.Inthissurvey,wefocus URL. These range from lexical information (length of URL, primarily on the static analysis techniques where machine the words used in the URL, etc.) to host-based information learning has found tremendous success. Dynamic analysis (WHOISinfo,IPaddress,location,etc.).Oncetheinformation techniques include monitoring the behavior of the systems is gathered, it is processed to be stored in a feature vector x. which are potential victims, to look for any anomaly. These Numericalfeaturescanbestoredinxasis,andidentityrelated include [29] which monitor the system call sequences for information or lexical features are usually stored through a abnormal behavior, and [30] which mine internet access log binarization or bag-of-words (BoW) approach. Based on the dataforsuspiciousactivity.Dynamicanalysistechniqueshave type of information used, x ∈ Rd generated from a URL is inherent risks, and are difficult to implement and generalize. a d-dimensional vector where d can be less than 100 or can In the following, we formalize the problem of malicious be in the order of millions. A unique challenge that affects URL detection as a machine learning task which allows us to this problem setting is that the number of features may not generalize most of the existing work in literature. Alternate be fixed or known in advance. For example, using a BoW problem settings will also be discussed in Section IV. approach one can track the occurrence for every type of word thatmayhaveoccurredinaURLinourtrainingdata.Amodel B. Problem Formulation can be trained on this data, but while predicting, new URLs may have words that did not occur in the training data. It is We formulate the problem of malicious URL detection as a thusachallengingtasktodesignagoodfeaturerepresentation binaryclassificationtaskfortwo-classprediction:“malicious” that is robust to unseen data. versus “benign”. Specifically, given a data set with T URLs After obtaining the feature vector x for the training data, {(u ,y ),...,(u ,y )}, where u for t = 1,...,T repre- 1 1 T T t to learn the prediction function f : Rd → R, it is usually sents a URL from the training data, and y ∈ {1,−1} is the t formulated as an optimization problem such that the detection correspondinglabelwherey =1representsamaliciousURL t accuracy is maximized (or alternately, a loss function is andy =−1representsabenignURL.Thecruxtoautomated t minimized).Thefunctionf is(usually)parameterizedbyad− malicious URL detection is two-fold: dimensional weight vector w, such that f(x) = (w(cid:62)x). Let 1) Feature Representation: Extracting the appropriate fea- yˆ =sign(f(x ))denotetheclasslabelpredictionmadebythe ture representation: u → x where x ∈ Rd is a d- t t t t t function f. The number of mistakes made by the prediction dimensional feature vector representing the URL; and model on the entire training data is given by: (cid:80)T I 2) Machine Learning: Learning a prediction function f : t=1 yˆt=yt where I is an indicator which evaluates to 1 if the condition Rd → R which predicts the class assignment for any is true, and 0 otherwise. Since the indicator function is not URL instance x using proper feature presentations. convex, the optimization can be difficult to solve. As a result, Consider a binary classification task, the goal of machine a convex loss function is often defined, and is denoted by learning for malicious URL detection is to maximize the (cid:96)(f(x),y) and the entire optimization can be formulated as: predictive accuracy. Both of the folds above are important to achieve this goal. While the first part of feature representation T (cid:88) min (cid:96)(f(x ),y ) (1) isoftenbasedondomainknowledgeandheuristics,thesecond t t w part focuses on training the classification model via a data t=1 driven optimization approach. Fig. 2 illustrates a general ar- Several types of loss functions can be used, including the chitectureofsolvingMaliciousURLDetectionusingmachine popular hinge-loss (cid:96)(f(x),y) = 1max(1 − yf(x),0), or 2 learning. the squared-loss (cid:96)(f(x),y) = 1(f(x) − y)2. Sometimes, a 2 The first key step is to convert a URL u into a feature regularization term is often added to prevent over-fitting or to vectorx,whereseveraltypesofinformationcanbeconsidered learnsparsemodels,orthelossfunctioncanbemodifiedbased and different techniques can be used. Unlike learning the on cost-sensitive nature of the data (e.g., class imbalanced prediction model, this part cannot be directly computed by distribution, different costs for diverse threats). a mathematical function (not for most of it). Using domain In the following, we will discuss the existing studies on knowledge and related expertise, a feature representation is feature representation for malicious URL detection and ap- constructed by crawling all relevant information about the propriate machine learning algorithms design in detail. 5 Fig.3. ExampleofinformationaboutaURLthatcanbeobtainedintheFeatureCollectionstage III. FEATUREREPRESENTATION Content-basedFeatures,andOthers(Context,Popularity,etc.). All have their benefits and short-comings - while some are As stated earlier, the success of a machine learning model very informative, obtaining these features can be very expen- critically depends on the quality of the training data, which sive. Similarly, different features have different preprocessing hinges on the quality of feature representation. Given a URL u∈U, where U denotes a domain of any valid URL strings, challenges and security concerns. Next, we will discuss each thegoaloffeaturerepresentationistofindamappingg :U→ of these feature categories in detail, followed by comparing Rd, such that g(u) → x where x ∈ Rd is a d-dimensional their pros and cons. feature vector, that can be fed into machine learning models. The process of feature representation can be further broken A. BlackList Features down into two steps: As mentioned before, a trivial technique to identify mali- 1) Feature Collection: This phase is engineering oriented, cious URLs is to use blacklists. An existing URL as having whichaimstocollectmostifnotallrelevantinformation been identified as malicious (either through extensive analysis about the URL. This includes information such as pres- orcrowdsourcing)makesitswayintothelist.However,ithas ence of the URLs in a blacklist, the direct features of been noted that blacklisting, despite its simplicity and ease of the URL such as the URL String and information about implementation, suffers from nontrivial high false negatives the host, the content of the web-site such as HTML and [7] due to the difficulty in maintaining exhaustive up-to-date JavaScript,popularityinformation,etc.Figure3givesan lists. Consequently, instead of using blacklist presence alone example to demonstrate various types various types of as a decision maker, it can be used as a powerful feature. In information that can be collected from a URL to obtain particular, [12] used the presence in a blacklist as a feature, the feature representation. from6differentblacklistserviceproviders.Theyalsoanalyzed 2) Feature Preprocessing: In this phase, the unstructured the effectiveness of these features compared to other features, information about the URL (e.g. textual description) is andobservedthatblacklistfeaturesalonedidnothaveasgood appropriately formatted, and converted to a numerical aperformanceasotherfeatures,butwhenusedinconjunction vector so that it can be fed into machine learning with other features, the overall performance of the prediction algorithms. For example, the numerical information can model improved. be used as is, and the BoW is used for representing [31]observedthattoevadedetectionviablacklisting,many textual or lexical content. Besides, some data normal- attackersmademinormodificationstotheoriginalURL.They ization (e.g., Z-score normalization) may often be used proposed to extend the blacklist by deriving new URLs based to handle the scaling issue. on five heuristics including: Replacing Top-Level Domain For malicious URL detection, researchers have proposed (TLDs), IP Address Equivalence, Directory Structure Simi- several types of features that can be used to provide use- larity, Query String substitution, and brand name equivalence. ful information. We categorize these features into: Blacklist Since, even a minor mismatch from the blacklist database can Features, URL-based Lexical Features, Host-based features, causeamaliciousURLtogoundetected,theyalsodevisedan 6 approximate matching solution. Similar heuristics potentially features.Thisnumbermaygrowevenlargerifbi-gramfeatures could be used when deriving blacklist features for machine were considered. [35] considered n-gram features (same as learning approaches. A similar methodology was adopted for bi-gram, but n can be > 2), and devised a feature selection automated URL blacklist generation by [32], [33]. [34] devel- schemebasedonrelativeentropytoreducethedimensionality. oped a method to proactively perform domain blacklisting. A similar feature extraction method was used by [37], where the feature weights were computed based on the ratio of their presence in one class of URLs against their presence in both B. Lexical Features classes of URLs. Lexical features are features obtained based on the prop- In order to avoid being caught by blacklists, hackers can erties of the URL name (or the URL string). The motivation generatemaliciousURLsalgorithmically.Usingbag-of-words is that based on how the URL ”looks” it should be possible feature for such URLs is likely to give a poor performance, to identify malicious nature of a URL. For example, many as algorithmically generated URLs may produce never before obfuscation methods try to ”look” like benign URLs by seen words (hence never before seen features). To detect mimicking their names and adding a minor variation to it. In suchalgorithmicallygeneratedmaliciousURLs,[38]analyzed practice, these lexical features are used in conjunction with character level strings to obtain the features. They argued that several other features (e.g. host-based features) to improve algorithmically generated domain names and those generated model performance. However, using the original URL name byhumanswouldhaveasubstantiallydifferentalpha-numeric directly is not feasible from a machine learning perspective. distribution. Further, since the number of characters is small, Instead, the URL string has to be processed to extract useful the number of features obtained would also be small. They features. Next we review some of the lexical features used for performed their analysis based on KL-divergence, Jaccard malicious URL detection. Coefficient, and Edit-distance using unigram and bigram dis- Traditional Lexical Features: The most commonly used tributions of characters. lexicalfeaturesincludestatisticalpropertiesoftheURLstring, AdvancedLexicalFeatures:Traditionallexicalfeatureswere like the length of the URL, length of each of the components directly obtained from the URL string without significant of the URL (Hostname, Top Level Domain, Primary domain, domainknowledgeorcomputation.Researchershaveproposed etc.), the number of special characters, etc. [35] were one of several advanced lexical features that exploit properties of thefirsttosuggestextractingwordsfromtheURLstring.The URL strings, to get more informative features. string was processed such that each segment delimited by a [39] derive new lexical features using heuristics with specialcharacter(e.g.”/”,”.”,”?”,”=”,etc.)comprisedaword. the objective of being obfuscation resistant. Based on the Based on all the different types of words in all the URLs, a obfuscation types identified by [8], five categories of features dictionary was constructed, i.e., each word became a feature. are proposed: URL-related features (keywords, length, etc.), If the word was present in the URL, the value of the feature Domain features (length of domain name, whether IP address would be 1, and 0 otherwise. This is also known as the bag- is used as domain name, etc.), Directory related features of-words model. (length of directory, number of subdirectory tokens, etc.), File Directly using the bag-of-words model, causes a loss of namefeatures(lengthoffilename,numberofdelimiters,etc.), information about the order in which the words occurred in and Argument Features(length of the argument, number of the URL. [12], [28] also used similar lexical features, but variables, etc.). they made the distinction between the tokens belonging to Another feature is based on the Kolmogorov Complexity the hostname, the path, the top-level domain and the primary [40].KolmogorovComplexityisameasureofcomplexityofa domain name. This was done by having a separate dictionary string s. Conditional Kolmogorov Complexity is the measure for each of these segments. The distinction would allow for of the complexity of a string s given another string for free. preservingsomeoftheorderinwhichthewordsoccurred.For Thismeansthatthepresenceofthefreestringdoesnotaddto example, it allows us to distinguish between the presence of thecomplexityoftheoriginalinputstrings.Basedonthis,for ”com”inthetop-leveldomainvsotherpartsoftheURL.[36] agivenURL,wecomputetheURL’sConditionalKolmogorov try to enhance the lexical features by considering the usage ComplexitywithrespecttothesetofBenignURLsandtheset of bi-gram features, i.e., they construct a dictionary, where in ofMaliciousURLs.Combiningthesemeasureswegetasense additiontosingle-wordsthepresenceofasetof2-wordsinthe of whether the given URL is more similar to the Malicious sameURLisconsideredafeature.Inaddition,theyrecordthe URL database or the Benign URL database. This feature, position of sensitive tokens and bigrams to exploit the token though useful, may not be easy to scale up to very large context sensitivity. number of URLs. [41], [42] define a new concept of intra- The entire bag-of-word features approach can be viewed URL relatedness which is a measure to quantify the relations as a form of machine learning compatible fuzzy blacklist between different words that comprise the URL with specific approach. Instead of focussing on the entire URL string, it focus on relationship between the registered domain and the assigns scores to the URL based on smaller components of rest of the URL. [43] propose new distance based metrics, the URL string. While this approach offers us an extensive called domain brand name distance and path brand name number of features, it can become problematic while running distance. These are essentially types of edit distance between sophisticated algorithms on them. For example, [28] collected strings aimed at detecting those malicious URLs which try to a dataset of 2 million URLs, having almost as many lexical mimic popular brands or websites. 7 C. Host-based Features D. Content-based Features Host-based features are obtained from the host-name prop- Content-based features are those obtained upon download- erties of the URL [28]. They allow us to know the location ing the entire web-page. As compared to URL-based features, ofmalicioushosts,theidentityofthemalicioushosts,andthe these are ”heavy-weight”, as a lot of information needs to be management style and properties of these hosts. extracted, and at the same time, safety concerns may arise. [11]studiedtheimpactofafewhost-basedfeaturesonthe However, with more information available about a particular maliciousness of URLs. Some of the key observations were web-page, it is natural to assume that it would lead to a better that phishers exploited Short URL services; the time-to-live prediction model. Further, if the URL-based features fail to from registration of the domain was almost immediate for the detect a malicious URL, a more thorough analysis of the malicious URLs; and many used botnets to host themselves content-based features may help in early detection of threats on multiple machines across several countries. Consequently, [19]. The content-based features of a web-page can be drawn host-based features became an important element in detecting primarilyfromitsHTMLcontent,andtheusageofJavaScript. malicious URLs. [50] categorize the content based features of a web-page into 5 broad segments: Lexical features, HTML Document Level [12],[28]borrowedideasfrom[11]andproposedtheusage Features, JavaScript features, ActiveX Objects and feature ofseveralhost-basedfeaturesincluding:IPAddressproperties, relationships. [51], [52] proposed CANTINA and its variants WHOIS information, Location, Domain Name Properties, and fordetectingphishingwebsitesusingacomprehensivefeature- Connection Speed. The IP Address properties comprise fea- based machine learning approach, by exploiting various fea- tures obtained from IP address prefix and autonomous system turesfromtheHTMLDocumentObjectModel(DOM),search (AS) number. This included whether the IPs of A, MX or engines and third party services. In the following we discuss NS records are in the same ASes or prefixes as one another. some of these categories, primarily focusing on the HTML The WHOIS information comprises domain name registration Document Level Features and JavaScript Features. dates, registrars and registrants. The Location information comprisesthephysicalGeographicLocation-e.g.country/city 1) HTML Features: [50] proposed the usage of lexical towhichtheIPaddressbelongs.TheDomainNameproperties featuresfromtheHTMLoftheweb-page.Thesearerelatively comprisetime-to-livevalues,presenceofcertainkeywordslike easytoextractandpreprocess.Atthenextlevelofcomplexity, ”client” and ”server”, if the IP address is in the host name theHTMLdocumentlevelfeaturescanbeused.Thedocument or not and does the PTR record resolve one of the host’s level features correspond to the statistical properties of the IP addresses. Since many of the features are identity related HTML document, and the usage of specific types of function- information, a bag-of-words like approach is required to store ality.[50]proposetheusageoffeatureslike:lengthofthedoc- them in a numerical vector, where each word corresponds to ument,averagelengthofthewords,wordcount,distinctword a specific identity. Like the lexical features, adopting such count, word count in a line, the number of NULL characters, an approach leads to a large number of features. For the usage of string concatenation, unsymmetrical HTML tags, the 2 million URLs, [28] obtained over a million host-based link to remote source of scripts, and invisible objects. Often features.ExclusiveusageofIPAddressFeatureshasalsobeen maliciouscodeisencryptedintheHTML,whichislinkedtoa considered[44].IPAddressFeaturesarearguablymorestable, largewordlength,orheavyusageofstringconcatenation,and asitisdifficulttoobtainnewIPAddressesformaliciousURLs thus these features can help in detecting malicious activity. continuously.Duetothisstability,theyserveasimportantfea- Similarfeatureswithminorvariationswereusedbymanyof tures in malicious URL detection. However, it is cumbersome the subsequent researchers including [46] (number of iframes, to use IP Address directly. Instead, it is proposed to extract number of zero size iframes, number of lines, number of IP Address features based on a binarization or categorization hyperlinks,etc.).[19]alsousedsimilarfeatures,andaddition- approachthroughwhichoctet-based,extended-octetbasedand ally proposed to use several more descriptive features which bit-string based features are generated. were aimed at minor statistical properties of the page. These DNSFluxinessfeatureswereproposedtolookformalicious includefeaturessuchasnumberofelementswithasmallarea, URLs that would hide their identity by using proxy networks number of elements with suspicious content (suspiciousness andquicklychangingtheirhost[45],[46].[43]definedomain was determined by the length of the content between the age and domain confidence (dependent on similarity with a start and end tag), number of out of place elements, presence white-list) level which help determine the fluxiness nature of of double documents, etc. [53] developed a delta method, theURL(e.g.maliciousURLsusingfastfluxwillhaveasmall where delta represented the change in different versions of domain age). [47] propose new features to detect malicious thewebsite.Theyanalyzedwhetherthechangewasmalicious URLsthatarehiddenwithintrustedsites.Theyextractheader or benign. features from HTTP response headers. They also use the age 2) JavaScript Features: [50] argue that several JavaScript obtainedfromthetimestampvalueofthelastmodifiedheader. functions are commonly used by hackers to encrypt malicious [48] propose Application Layer features and Network Layer code, or to execute unwanted routines without the client’s featurestodeviseacross-layermechanismtodetectmalicious permission. For example extensive usage of function eval() URLs. [49] suggest the usage of temporal variation patterns and unescape() may indicate execution of encrypted code basedonactiveanalysisofDNSlogs,tohelpdiscoverdomain within the HTML. They aim to use the count of 154 native names that could be abused in the future. JavaScript functions as features to identify malicious URLs. 8 [46] identify a subset (seven) of these native JavaScript social media platforms like twitter, where the originally long functions that are often in Cross-site scripting and Web-based URLs would not fit within the 140 character limit of a tweet. malware distribution. These include: escape(), eval(), link(), Unfortunately, this has also become a popular obfuscation unescape(), exec(), and search() functions. [19] propose ad- technique for the malicious URLs. While the Short URL ditional heuristic JavaScript features including: keywords-to- service providers try their best to not generate short URLs wordsratio,numberoflongstrings,presenceofdecodingrou- for the malicious ones, they struggle to do an effective job as tines, shell code presence probability, number of direct string they also rely primarily on blacklists [67], [68]. As a result, a assignments, number of DOM-modifying functions, number recentlyemergingresearchdirectionhasbecomeactivewhere of event attachments, number of suspicious object names, context-features of the URL are obtained, i.e., the features of number of suspicious strings, number of ”iframe” strings and the background information where the URL has been shared. number of suspicious string tags. In [54], the authors try [69] use context information derived from the tweets where to detect JavaScript Obfuscation by analyzing the JavaScript the URL was shared. [70] used click traffic data to classify codesusingn-gram,EntropyandWordSize.n-gramandword shortURLsasmaliciousornot.[71]proposeforwardingbased sizearecommonlyusedtolookforcharacter/worddistribution features to combat forwarding-based malicious URLs. [72] and presence for long strings. For Entropy of the strings, they propose another direction of features to identify malicious observe that obfuscated strings tend to have a lower entropy. URLs - they also focus on URLs shared on social media, and More recently, [55] applied deep learning techniques to learn aim to identify the malicious nature of a URL by performing feature representations from JavaScript code. behavioral analysis of the users who shared them, and the 3) Visual Features: There have also been attempts made users who clicked on them. These features are formally at using images of the webpages to identify the malicious called ”Posting-based” features and ”Click-based” features. nature of the URL. Most of these focus on computing visual [10] approach this problem with a systematic categorization similarity with protected pages, where the protected pages of context features which include content-related features refer to genuine websites. Finding a high level of visual (lexical and statistical properties of the tweet), context of the similarity of a suspected malicious URL could be indicative tweet features (time, relevance, and user mentions) and social of an attempt at phishing. One of the earliest attempts at features (following, followers, location, tweets, retweets and using visual features for this task was by computing the favorite count). Earth Mover’s Distance between 2 images [56]. [57], [58] Some other features used were designed heuristics to mea- addressedthesameproblemanddevelopedasystemtoextract surethepopularityoftheURL.Oneoftheearliestapproaches visual features of web pages based on text-block features to applying statistical techniques to detect malicious URLs and image-block features (using information such as block [8] aimed at probabilistically identifying the importance of size,color,etc.).Moreadvancedcomputervisiontechnologies specifichand-designedfeatures.TheseincludePage-basedfea- wereadaptedforthistask.ContrastContextHistogram(CCH) tures (Page rank, quality, etc.), Domain-based features (pres- features were suggested [59], and so were Scale Invariant enceinwhitedomaintable),Type-basedfeatures(obfuscation Feature Transform (SIFT) features [60]. Another approach types) and Word-based features(presence of keywords such as using visual feature was developed by [61], where an OCR ”confirm”, ”banking”, etc.). [73] use both the URL-based and was used to read the text in the image of the webpage. contentbasedfeatures,andadditionallyrecordtheinitialURL, [62] combine both textual and visual features for measuring the landing URL and the redirect chain. Further they record similarity. With recent advances in Deep Learning for Image thenumberofpopupsandthebehaviorofplugins,whichhave Recognition [63], [64], it may be possible to extract more been commonly used by spammers. [46] proposed the usage powerful and effective visual features. of new categories of features: Link Popularity and Network 4) Other Content-based Features: [50] argued that due to Features. Link Popularity is scored on the basis of incoming thepowerfulfunctionalityofActiveXobjects,theycanbeused links from other webpages. This information was obtained to create malicious DHTML pages. Thus, they tried to com- from different search engines. In order to make the usage pute frequency for each of eight ActiveX objects. Examples of these features robust to manipulation, they also propose include: “Scripting.FileSystemObject” which can be used for the usage of certain metrics that validate the quality of the filesystemI/Ooperations,“WScript.Shell”whichcanexecute links. They also use a metric to detect spam-to-spam URL shell scripts on the client’s computer, and “Adodb.Stream” links. For their work, they use these features in conjunction which can download files from the Internet. [65] try to find with lexical features content-based feature, and host-based the identity and keywords in the DOM text and evaluate the features. [20] used social reputation features of URLs by consistencybetweentheidentityobservedandtheidentityitis trackingtheirpublicsharecountonFacebookandTwitter.[74] potentially trying to mimic which is found by searching. [66] incorporatedinformationonredirectionchainsintoredirection used the directory structure of the websites to obtain insights. graphs,whichprovidedinsightintodetectingmaliciousURLs. [42]usesearchenginequerydatatomineforwordrelatedness measurement. E. Other Features Recent years have seen the growth of Short URL service F. Summary of Feature Representations providers, which provide the original URL to be represented There is a wide variety of information that can be obtained by a shorter string. This enables sharing of the URLs in on for a URL. Crawling the information and transforming the 9 unstructured information to a machine learning compatible that either exploit the properties exhibited by the training feature vector can be very resource intensive. While extra data of Malicious URLs, or address some specific challenges information can improve predictive models (subject to appro- which the application faces. In this section, we categorize and priateregularization),itisoftennotpracticaltoobtainalotof review the learning algorithms that have been applied for this features. For example, several host-based features may take task, and also suggest suitable machine learning technologies a few seconds to be obtained, and that itself makes using that can be used to solve specific challenges encountered. theminrealworldsettingimpractical.Anotherexampleisthe We categorize the learning algorithms into: Batch Learning Kolmogorov Complexity - which requires comparing a URL Algorithms, Online Algorithms, Representation Learning, and to several malicious and benign URLs in a database, which is Others.BatchLearningalgorithmsworkundertheassumption infeasible for comparing with billions of URLs. Accordingly, that the entire training data is available prior to the training care must be taken while designing a Malicious URL Detec- task. Online Learning algorithms treat the data as a stream of tion System to tradeoff the usefulness of a feature and the instances,andlearnapredictionmodelbysequentiallymaking difficulty in retrieving it. We present a subjective evaluation predictions and updates. This makes them extremely scalable ofdifferentfeaturesusedinliterature.Specifically,weevaluate comparedtobatchalgorithms.Next,wediscussrepresentation themonthebasisofCollectionDifficulty,AssociatedSecurity learning methods, which in the context of Malicious URL Risks,needforanexternaldependencytoacquireinformation, Detection are largely concentrated towards feature selection the associated time cost with regard to feature collection and techniques. Lastly, we discuss other learning algorithms, in feature preprocessing, and the dimensionality of the features challengesspecifictoMaliciousURLDetectionareaddressed, obtained. including cost-sensitive learning, active learning, similarity Collection difficulty is refers to the engineering effort learning, unsupervised learning and string pattern matching. required to obtain specific information about the features. Blacklist, context and popularity features require additional A. Batch Learning dependencies and thus have a higher collection overhead, whereastheotherfeaturesaredirectlyobtainedfromtheURL Following the previous problem setting, consider a URL itself. This also implies, that for a live-system (i.e. real-time datasetwithT URLs{(u ,y ),...,(u ,y )},whereu ∈U 1 1 T T t MaliciousURLDetection),obtainingfeatureswithahighcol- for t ∈ 1,...,T represents a URL from the training data, lection time may be infeasible. In terms of associated security and y ∈ {1,−1} is its class label where y = 1 indicates a t risks, the content-features have the highest risk, as potential malicious URL and y = −1 indicates a benign URL. Using malware may be explicitly downloaded while trying to obtain an appropriate feature representation scheme (g : U (cid:55)→ Rd) these features, while other features do not suffer from these as discussed in the previous section, one can map a URL issues.Thecollectiontimeoftheblacklistfeaturescanbehigh instance into a d-dimensional feature vector, i.e., g(u )→x . i i if the external dependency has to be queried during runtime, Asaresult,onecanapplyanyexistinglearningalgorithmthat however, if the the entire blacklist can be stored locally, the can work with vector space data to train a predictive model collection overhead is very small. Collection of the lexical for malicious URL detection tasks. In this section we review featuresisveryefficient,astheyarebasicallydirectderivatives themostcommonpopularbatchlearningalgorithmsthathave of the URL string. Host-based features are relatively time- been applied for Malicious URL Detection. consuming to obtain. Content-features usually require down- A popular family of batch learning algorithms can be loadingtheweb-pagewhichwouldaffectthefeaturecollection categorized under a discriminative learning framework using time. For preprocessing, once the data has been collected, regularized loss minimization as: deriving the features in most cases is computationally very T fast. For dimensionality size, the lexical features have a very (cid:88) min (cid:96)(f(x ),y )+λR(w) (2) high-dimensionality(andsodounstructuredHost-featuresand f t t t=1 contentfeatures).Thisislargelybecausetheyareallstoredas Bag-of-Words features. This feature size consequently affects wheref(xt)canbeeitheralinearmodel,e.g.,f(xt)=w·xt+ the training and test-time. These properties are summarized b,orsomenonlinearmodels(kernel-basedorneuralnetworks), in Table I. We also categorize the representative references (cid:96)(f(xt),yt) is some loss function to measure the difference according to the feature representation used, in Table II. between the model’s prediction f(xt) and the true class label y,R(w)isaregularizationtermtopreventoverfitting,andλis a regularization parameter to trade-off model complexity and IV. MACHINELEARNINGALGORITHMSFOR simplicity. In the following, we discuss two popular learning MALICIOUSURLDETECTION algorithms under this framework: Support Vector Machines There is a rich family of machine learning algorithms in and Logistic Regression. literature, which can be applied for solving malicious URL 1) SupportVectorMachine: (SVM)isoneofmostpopular detection. After converting URLs into feature vectors, many supervised learning methods. It exploits the structural risk of these learning algorithms can be generally applied to minimization principle using a maximum margin learning train a predictive model in a fairly straightforward manner. approach,whichessentiallycanbeviewedasaspecialinstance However, to effectively solve the problem, some efforts have of the regularized loss minimization framework. Specifically, also been explored in devising specific learning algorithms by choosing the hinge loss as the loss function and maximiz- 10 TABLEI PROPERTIESOFDIFFERENTFEATUREREPRESENTATIONSFORMALICIOUSURLDETECTION Features Category Criteria Collection External Collection Processing Feature Difficulty Risk Dependency Time Time Size Blacklist Blacklist Moderate Low Yes Moderate Low Low Traditional Easy Low No Low Low VeryHigh Lexical Advanced Easy Low No Low High Low Unstructured Easy Low No High Low VeryHigh Host Structured Easy Low No High Low Low HTML Easy High No Depends Low High JavaScript Easy High No Depends Low Moderate Content Visual Easy High No Depends High High Other Easy High No Depends Low Low Context Difficult Low Yes High Low Low Others Popularity Difficult Low Yes High Low Low TABLEII REPRESENTATIVEREFERENCESOFDIFFERENTTYPESOFFEATURESUSEDBYRESEARCHERSINLITERATURE Feature SubCategory RepresentativeReferences Blacklist Blacklist [8],[12],[31]–[34] Lexical Lexical [8],[12],[13],[19],[20],[28],[35]–[43],[46],[48],[62],[73],[75]–[91] Host Host-based [11]–[13],[17],[19],[28],[44]–[46],[46]–[48],[73],[76],[77],[80],[85] HTML [2],[19],[20],[22],[46],[48],[50],[73],[78],[79],[83],[84],[90],[92]–[95] JavaScript [2],[4],[19],[20],[26],[48],[50],[54],[55],[73],[83],[84],[94] Content Visual [56]–[62],[79],[96] Others [50],[65],[66],[94] Context-based [10],[16],[69]–[72],[97] Others Popularity-based [8],[20],[42],[46],[73],[77],[83],[84],[98]–[102] ing the margin, SVM can be formulated into the following independence, and Decision Tree which adopts a greedy optimization: approach to constructing if-else rules based on the features offering the best splitting criteria. T 1 (cid:88) (w,b)←arg min max(0,1−y (w·x +b))+λ(cid:107)w(cid:107)2 3) Naive Bayes: is a generative model for classification, T t t 2 w,b t=1 which is “naive” in the sense that this model assumes all features of x are independent of each other. Specifically, In addition, SVM can learn nonlinear classifiers using kernels let P(x|y) denote the conditional probability of the feature [103]. SVMs are probably one of the most commonly used vectorgivenaclass,theindependenceassumptionimpliesthat classifiersforMaliciousURLDetectioninliterature[10],[12], P(x|y)=Πd P(x |y),wheredisthenumberoffeatures.By [35], [40]–[43], [47], [48], [50], [69], [70], [79], [81], [92], i=1 i applying the Bayes Theorem, one can compute the posterior [93]. probability that a feature vector x is a malicious URL by 2) Logistic Regression: is another well-known discrimina- tive model which computes the conditional probability for a P(x|y =1) feature vector x to be classified as a class y =1 by P(y =1|x)= (5) P(x|y =1)+P(x|y =−1) 1 P(y =1|x;w,b)=σ(w·x+b)= (3) Naive Bayes has been used for Malicious URL Detection by 1+e−(w·x+b) [19], [48], [50], [62], [71], [97]. Based on the maximum-likelihood estimation (equivalently 4) Decision Trees: is one of most popular methods for defining the loss function as the negative log likelihood), the inductive inference and has a major advantage of its highly optimization of logistic regression can be formulated as interpretabledecisiontreeclassificationmodelswhichcanalso T been converted into a rule set for human readability. Decision 1 (cid:88) (w,b)←arg minT −logP(yt|xt;w,b)+λR(w) (4) TreeshavebeenusedformaliciousURL/webclassificationby w,b t=1 [10], [12], [19], [22], [41], [42], [44], [48], [70], [71], [80], where the regularization term can be either L2-norm R(w)= [97]. A closely related approach which gives us rules in the ||w|| or L1-norm R(w) = ||w|| for achieving a sparse formofIf-thenwasappliedinusingAssociativeClassification 2 1 modelforhigh-dimensionaldata.LogisticRegressionhasbeen mining by [104]. a popular learning method for Malicious URL Detection [8], 5) Others and Ensembles: In addition to the above, [12], [19], [48], [50], [70], [79]. other recently proposed approaches include applying Extreme Other commonly used supervised learning algorithms fo- Learning Machines (ELM) for classifying the phishing web cus on feature-wise analysis to obtain the prediction model. sites using ELM by combining hybrid features in [105], and These include the Naive Bayes Classifier which computes thesphericalclassificationapproachthatallowsbatchlearning the posterior probability of the class label assuming feature models to be suitable for a large number of instances [106].

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.