By Jesse Sampson

Edit Distance Part 2

In part 1 of this post, we looked at hunting for malicious executables with edit distance (i.e., how many character edits it takes to make two text strings match). Now let’s look at how we can use edit distance to hunt for malicious domains, and how we can build edit distance features that can be combined with other domain name features to pinpoint suspicious activity.

Background

What are bad actors doing with malicious domains? It could be simply using a close spelling of a common domain to trick careless users into viewing ads or picking up adware. Legitimate websites are slowly catching onto this technique, sometimes called typo-squatting.

Other malicious domains are the product of domain generation algorithms, which can be used to do all kinds of nefarious things like evade countermeasures that block known compromised sites, or overwhelm domain name servers in a distributed denial of service attack. Older variants use randomly-generated strings, while more advanced ones add tricks like injecting common words, further confusing defenders.

Edit distance can help with both use cases: let’s see how. First, we’ll exclude common domain names, since these are usually safe. And, a list of normal domain names provides a baseline for detecting anomalies. One good source is Quantcast (Source: Quantcast. For this discussion, we will stick to domain names and avoid subdomains (e.g. ziften.com, not www.ziften.com).

After data cleaning, we compare each candidate domain name (input data observed in the wild by Ziften) to its potential neighbors in the same top-level domain (the last part of a domain name–classically .com, .org, etc. but now can be almost anything). The basic job is to find the nearest neighbor in terms of edit distance. By finding domains that are one step away from their nearest neighbor, we can easily spot typo-ed domains. By finding domains far from their neighbor (the normalized edit distance we introduced in Part 1 is useful here), we can also find anomalous domains in the edit distance space.

Results

Let’s look at how these results appear in real life. Take care navigating to these domains since they could contain malicious content!

Here are a few potential typos. Typo-squatters target popular domains since there are more chances someone will visit. Several of these are suspicious according to our threat feed partners, but there are some false positives too with cute names like “wikipedal”.

candidate nearest rank Edit distance Edit dist normalized
googhle.com google.com 1 1 0.091
wikipedal.org wikipedia.org 4 2 0.154
rfacebook.com facebook.com 5 1 0.077
porneverest.com pinterest.com 13 4 0.267
pple.com apple.com 26 1 0.111
craigstlist.com craigslist.com 2309 1 0.067

Here are some weird looking domains far from their neighbors.

candidate nearest rank Edit distance Edit dist normalized
a1802d421897c44eb30c53106f213f3b.ovh tumblr.ovh 54865 31 0.861
southafricanpostoffice.post ems.post 56817 21 0.778
jnqttflcg3951jnqttflcg.xyz watchonepiece.xyz 77500 18 0.692
pistaenjuego.ovh tumblr.ovh 54865 11 0.688
sjpexaylsfjnopulpgkbqtkzieizcdtslnofpkafsqweztufpa.com newyorktimescrosswordanswers.com 35245 37 0.685

So now we have created two useful edit distance metrics for hunting. Not only that, we have three features to potentially add to a machine learning model: rank of nearest neighbor, distance from neighbor, and edit distance 1 from neighbor, indicating a risk of typo shenanigans. Other features that could play well with these include other lexical features such as word and n-gram distributions, entropy, and string length–and network features like number of failed DNS requests.

Code

Here is a simplified version of the code to play with! Developed on HP Vertica, but this SQL should work on most advanced databases. Note the Vertica editDistance function may vary in other implementations (e.g. levenshtein in Postgres or UTL_MATCH.EDIT_DISTANCE in Oracle).


--get our list of comparison domains
--Source: Quantcast, https://www.quantcast.com/top-sites
CREATE TABLE if not exists quantcast(rank int,site varchar(200));
truncate TABLE
   quantcast;
COPY quantcast FROM LOCAL '/Users/jesse/Downloads/Quantcast-Top-Million.txt' delimiter '|';
delete from quantcast where site = 'Hidden profile'
;
--just some random websites like we see from around the world
with candidates as (
select 'dailytelangana.com' as hostname union all select 'www.nomura-recruit.co.jp' union all select 'geographyworldonline.com' union all select 'www.felonyflorida.com' union all select 'sso.kedou.com' union all select 'www.nestacertified.com' union all select 'www.rfacebook.com' union all select 'a1802d421897c44eb30c53106f213f3b.ovh'
)


--top level domains used to clean
--full list available here: https://www.iana.org/domains/root/db
,iana_tlds as (
select '.ovh' as domain, 'generic' as type union all
select '.jp', 'country-code' union all
select '.com', 'generic'
)


--build regular expressions to help us match candidates to comparison domains
--don't want our .co.jp domain name to show as just co.jp
,tld as(
select domain,type,'[^.]+\'||domain||'$' regex
,'[^.]+\.[^.]{2,4}'||domain||'$' as country_regex
from iana_tlds
)


--here we use our table to properly extract domain plus top-level domain
--we only want standard top-level domain for this analysis
,t as (
select domain,case when tld.TYPE = 'country-code' then regexp_substr(trim(hostname),country_regex)
else regexp_substr(trim(hostname),regex) end as domain_name
from tld
join candidates   
on case when tld.TYPE = 'country-code' then regexp_substr(trim(hostname),country_regex)
else regexp_substr(trim(hostname),regex) end is not null
and tld.domain = regexp_substr(trim(hostname),'\.[^.]+$')
)


--clean list of candidate domains
,domain_names as(
select distinct domain,domain_name from t
left join quantcast on quantcast.site = t.domain
where
--very short names are too noisy since their edit distance space is small
length(domain_name) >= 6
--don't want to include top sites since they are likely benign
and quantcast.site is null
)


--get top domains. smaller list for demo but you can look at them all if you want ;-)
,top_domains as(select * from quantcast where length(site)>=6 and rank<=100000)


--calculate edit distance
,edit_distance as (
select domain_name
,site
,rank
,editDistance(site,domain_name) as edit_distance
,GREATEST(length(site),length(domain_name)) as norm
from domain_names
join top_domains
on domain = regexp_substr(trim(site),'\.[^.]+$')
)


--get nearest neighbors and conquer
select distinct domain_name
    ,first_value(site) over(partition by domain_name order by edit_distance/norm) as nearest
    ,first_value(rank) over(partition by domain_name order by edit_distance/norm) as nearest2      
    ,min(edit_distance) over(partition by domain_name) as edit_distance
    ,ROUND(min(edit_distance/norm) over(partition by domain_name),3)::real as edit_distance_norm

    from edit_distance
;

Further Reading:

“The Long ‘Taile’ of Typosquatting Domain Names”
https://www.usenix.org/system/files/conference/usenixsecurity14/sec14-paper-szurdi.pdf

“Detecting Algorithmically Generated Malicious Domain Names”
https://pdfs.semanticscholar.org/0781/d1e2c2027794393eb33cc08d60fff2940c32.pdf

Get the General Here