By Jesse Sampson

Edit Distance Part 1

Why do attackers keep using the same old tricks? The answer is, the old tricks keep working. For example, Cisco’s 2017 Cybersecurity Report tells us that after years of decline, spam email with malicious attachments is again on the rise. In that traditional attack vector, malware authors typically mask their activities by using a filename similar to a common system process.

There is not necessarily a connection between a file’s path name and its contents: anyone who has tried to hide sensitive information by giving it a boring name like “taxes”, or changed the extension on a file attachment to circumvent email rules is aware of this concept. Malware authors know this too, and will often name malware to resemble common system processes. For example, “explore.exe” is Internet Explorer, but “explorer.exe” with an extra “r” could be anything. It’s easy even for experts to overlook this minor difference.

The opposite problem, known .exe files running in unusual locations, is easy to solve, using SQL sets and string functions.

with some_paths as (
select 'c:\windows\syswow64\rundll32.exe' as path union all select
'c:\windows\system32\rundll32.exe' union all select
'c:\windows\system32\csrss.exe' union all select
'c:\windows\system32\notepad.exe' union all select
'c:\program files\windows nt\accessories\wordpad.exe' union all select
'c:\program files (x86)\internet explorer\iexplore.exe'
, a as(
select regexp_replace(path,'[^\\]+$','') as path,regexp_substr(path,'[^\\]+$') as exe
from some_paths
, b as(
select regexp_replace(path,'[^\\]+$','') as path,regexp_substr(path,'[^\\]+$') as exe
from some_paths
select * from a
join b
on a.exe= b.exe and a.path <> b.path

What about the other case, finding near matches to the executable name? Most people begin their hunt for near string matches by sorting data and visually searching for discrepancies. This typically works well for a small set of data, maybe even a single system. To find these patterns at scale, however, requires an algorithmic approach. One established technique for “fuzzy matching” is to use Edit Distance.

What’s the best approach to calculating edit distance? For Ziften, our technology stack includes HP Vertica, which makes this task easy. The internet is full of data scientists and data engineers singing Vertica’s praises, so it will suffice to mention that Vertica makes it easy to create custom functions that take full advantage of its power – from C++ power tools, to statistical modeling scalpels in R and Java.

This Git repo is maintained by Vertica enthusiasts working in industry. It’s not an official offering, but the Vertica team is definitely aware of it, and moreover is thinking every day about how to make Vertica more useful for data scientists – a good space to watch. Best of all, it contains a function to calculate edit distance! There are also some other tools for natural language processing here like word stemmers and tokenizers.

By using edit distance on the top executable paths, we can quickly find the closest match to each of our top hits. This is an interesting dataset as we can sort by distance to find the closest matches over the entire dataset, or we can sort by frequency of the top path to see what is the closest match to our commonly used processes. This data can also surface on contextual “report card” pages, to show, e.g. the top five closest strings for a given path. Below is a toy example to give a sense of usage, based on real data ZiftenLabs observed in a customer environment.

--create example common paths
with top_paths as (
select 'c:\windows\system32\winlogon.exe' as path
union all select 'c:\windows\system32\rundll32.exe'
union all select 'c:\windows\system32\csrss.exe'
union all select 'c:\windows\system32\notepad.exe'
union all select 'c:\program files\windows nt\accessories\wordpad.exe'
union all select 'c:\program files (x86)\internet explorer\iexplore.exe'

--create example weird paths to comp
,comparison_paths as
--not malware but uncommon
select 'c:\program files\nch software\wavepad\wavepad.exe' as path
union all
--suspicious path!
select 'c:\users\appdata\local\apps\2.0\...00_69190547fa2b8c15\iexplorer.exe'

--separate last part of path after \; need different code for non-Windows OS
,tp as (
select regexp_substr(path,'[^\\]+$') as path from top_paths
,cp as (select regexp_substr(path,'[^\\]+$') as path from comparison_paths)

--get smallest edit distance and normalized ED per common path
select distinct
tp.path as top_path,
first_value(cp.path) over w1 as comp_path,
first_value(length(tp.path)) over w1 as tp_len,
first_value(length(cp.path)) over w1 as cp_len,
MIN(public.EditDistance(tp.path, cp.path)) over w edit_distance,
MIN(ROUND(public.EditDistance(tp.path, cp.path)/GREATEST(length(tp.path),length(cp.path)),3)::real) over w norm_distance
from tp
cross join cp
window w as (PARTITION BY tp.path)
, w1 as (partition by tp.path order by public.EditDistance(tp.path, cp.path)/GREATEST(length(tp.path),length(cp.path)))
order by 1,6

Setting a threshold of 0.2 seems to find good results in our experience, but the point is that these can be adjusted to fit individual use cases. Did we find any malware? We notice that “teamviewer_.exe” (should be just “teamviewer.exe”), “iexplorer.exe” (should be “iexplore.exe”), and “cvshost.exe” (should be svchost.exe, unless maybe you work for CVS pharmacy…) all look strange. Since we’re already in our database, it’s also trivial to get the associated MD5 hashes, Ziften suspicion scores, and other attributes to do a deeper dive.

top_name count closest norm_dist other_MD5 suspicious
svchost.exe 201 cvshost.exe 0.182 13df3c517c4feaade25f8f5d51ceb7dc TRUE
teamviewer.exe 5 teamviewer_.exe 0.066 6f2d7f7e6b1b2af8c04b6ecbc8cb6aa5 FALSE
explorer.exe 161 iexplorer.exe 0.076 1c4cef04201c5d554ea72dcaa8f4bded FALSE
iexplore.exe 157 iexplorer.exe 0.076 1c4cef04201c5d554ea72dcaa8f4bded FALSE

In this particular real-life environment, it turned out that teamviewer_.exe and iexplorer.exe were portable applications, not known malware. We assisted the customer with further investigation on the user and system where we observed the portable applications since use of portable apps on a USB drive might be evidence of naughty activity. The more disturbing finding was cvshost.exe. Ziften’s intelligence feeds indicate that this is a suspicious file. Searching for the md5 hash for this file on VirusTotal confirms the Ziften data, indicating that this is a potentially serious Trojan virus that could be part of a botnet or doing something even more malicious. Once the malware was found, however, it was easy to solve the problem and make sure it stays solved using Ziften’s capability to kill and persistently block processes by MD5 hash.

Even as we develop advanced predictive analytics to detect malicious patterns, it is important that we continue to improve our capabilities to hunt for known patterns and old tricks. Just because new threats emerge doesn’t mean the old ones go away!

If you liked this post, watch this space for part 2 of this series where we will apply this method to hostnames to detect malware droppers and other malicious sites.

Get the General Here