Writing Regular Expressions | Internet Forensics

Let's look at some simple regular expression examples:

 PS C:\> "SAPIEN Press" -match "\w" True PS C:\> $matches Name                           Value ----                           ----- 0                              S PS C:\> "SAPIEN Press" -match "\w*" True PS C:\> $matches Name                           Value ----                           ----- 0                              SAPIEN PS C:\> "SAPIEN Press" -match "\w+" True PS C:\> $matches Name                           Value ----                           ----- 0                              SAPIEN PS C:\> "SAPIEN Press" -match "\w* \w*" True PS C:\> $matches Name                           Value ----                           ----- 0                              SAPIEN Press PS C:\>

The first example compares the string "SAPIEN Press" to the pattern \w. Recall that comparison results are automatically stored in the $matches variable. As you can see, \w matches "S". Why doesn't it match "SAPIEN" or "SAPIEN Press"? The \w pattern means any word, even a single character word. If we want to match a complete word, then we need to use one of the regular expression qualifiers. For example, you can see the second and third examples use \w* and \w+ respectively. In this particular example, these patterns return the same results.

If we want to match a two word phrase, then we would use an example like the last one using \w* \w*. If we're testing a match for "SAPIEN Press PowerShell", then we'd use something like this:

 PS C:\> "SAPIEN Press Powershell" -match "\w* \w*" True PS C:\> $matches Name                           Value ----                           ----- 0                              SAPIEN Press PS C:\>

This also matches, but as you can see it only matches the first two words. If we want to specifically match a two word pattern, then we need to qualify our regular expression so it starts and ends with a word:

 PS C:\> "SAPIEN Press Powershell" -match "^\w* \w*$" False PS C:\> "SAPIEN Press" -match "^\w* \w*$" True PS C:\>

The recommended best practice for strict regular expression evaluation is to use the ^ and $ qualifiers to denote the beginning and end of the matching pattern.

Here's another example:

 PS C:\> "1001" -match "\d" True PS C:\> $matches Name                           Value ----                           ----- 0                              1 PS C:\> "1001" -match "\d+" True PS C:\> $matches Name                           Value ----                           ----- 0                              1001 PS C:\>

In the first example, the digit matching pattern is used to get a TRUE result. However, $matches shows it only matched the first digit. Using \d+ in the second example returns the full value.

If we want the number to be four digits, then we can use a qualifier like this:

 PS C:\> "1001" -match "\d{4,4}" True PS C:\> "101" -match "\d{4,4}" False PS C:\>

The qualifier {4,4} indicates that we want to find a string with at least four matches. In this case, that would be an integer (\d) and no more than 4. When we use the regular expression to evaluate 101 it returns TRUE.

The following example shows a regular expression that is evaluating a UNC path string:

 PS C:\> "\\file01\public" -match "^\\\\w*\w*\\\w*$" True PS C:\> $matches Name                          Value ----                          ----- 0                             \\file01\public PS C:\>

This example looks a little confusing, so let's break it apart. First, we want an exact match so we're using ^ and $ to denote the beginning and end of the regular expression. We know the server name and path will be alphanumeric words, so we can use \w. Because we want the entire word we'll use the * qualifier. All that's left are the \\ and \ characters in the UNC path. Remember that \ is a special character in regular expressions. If we want to match the \ character itself, then we need to "escape" it using another \ character. In other words, each \ will become \\. So the elements of the regular expression break down to:

 ^ (beginning of expression) \\    \\\\ \w* (servername) \ \\ \w* (sharename) $ (end of expression)

Putting this all together we end up with ^\\\\w*\\\w*$. As you can see in the example, this is exactly what we get.

Here's a regular expression pattern to match an IP address:

 PS C:\> "192.168.100.2" -match "^\d+\.\d+\.\d+\.\d+$" True PS C:\> $matches Name                          Value ----                          ----- 192.168.100.2 PS C:\> "192.168.100" -match "^\d+\.\d+\.\d+\.\d+$" False PS C:\>

This should begin to look familiar. We're matching digits and using the \ character to escape the period character since the period is a special regular expression character. By using the beginning and end of expression characters, we also know that we'll only get a successful match on a string with 4 numbers that are separated by periods. Of course there's more to an IP address than four numbers. Each set of numbers can't be greater than three digits long. Here's how we can construct a regular expression to validate that:

 PS C:\> "192.168.100.2" -match "^\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}$" True PS C:\> "172.16.1.2543" -match "^\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}$" False PS C:\>

The first example matches because each dotted octet is between 1 and 3 digits. The second example fails because the last octet is 4 digits.