Data Patterns in Custom Sensitive Data Types

You define the data pattern for a custom data type using a simple set of regular expressions comprised of the following:

  • three metacharacters

  • escaped characters that allow you to use the metacharacters as literal characters

  • six character classes

Metacharacters are literal characters that have special meaning within regular expressions.

Sensitive Data Pattern Metacharacters

Metacharacter

Description

Example

?

Matches zero or one occurrence of the preceding character or escape sequence; that is, the preceding character or escape sequence is optional.

colou?r matches color or colour

{n}

Matches the preceding character or escape sequence n times.

For example,
\d{2} matches 55, 12, and so on;
\l{3} matches AbC, www, and so on;
\w{3} matches a1B, 25C, and so on; 
x{5} matches xxxxx

\

Allows you to use metacharacters as actual characters and is also used to specify a predefined character class.

\? matches a question mark,
\\ matches a backslash, 
\d matches numeric characters, and so on

You must use a backslash to escape certain characters for the sensitive data preprocessor to interpret them correctly as literal characters.

Escaped Sensitive Data Pattern Characters

Use this escaped character...

To represent this literal character...

\?

?

\{

{

\}

}

\\

\

When defining a custom sensitive data pattern, you can use character classes.

Sensitive Data Pattern Character Classes

Character Class

Description

Character Class Definition

\d

Matches any numeric ASCII character 0-9

0-9

\D

Matches any byte that is not a numeric ASCII character

not 0-9

\l (lowercase “ell”)

Matches any ASCII letter

a-zA-Z

\L

Matches any byte that is not an ASCII letter

not a-zA-Z

\w

Matches any ASCII alphanumeric character

Note that, unlike PCRE regular expressions, this does not include an underscore (_).

a-zA-Z0-9

\W

Matches any byte that is not an ASCII alphanumeric character

not a-zA-Z0-9

The preprocessor treats characters entered directly, instead of as part of a regular expression, as literal characters. For example, the data pattern 1234 matches 1234.

The following data pattern example, which is used in system-provided sensitive data rule 138:4, uses the escaped digits character class, the multiplier and option-specifier metacharacters, and the literal dash (-) and left and right parentheses () characters to detect U.S. phone numbers:


(\d{3}) ?\d{3}-\d{4}

Exercise caution when creating custom data patterns. Consider the following alternative data pattern for detecting phone numbers which, although using valid syntax, could cause many false positives:


(?\d{3})? ?\d{3}-?\d{4}

Because the second example combines optional parentheses, optional spaces, and optional dashes, it would detect, among others, phone numbers in the following desirable patterns:

  • (555)123-4567

  • 555123-4567

  • 5551234567

However, the second example pattern would also detect, among others, the following potentially invalid patterns, resulting in false positives:

  • (555 1234567

  • 555)123-4567

  • 555) 123-4567

Consider finally, for illustration purposes only, an extreme example in which you create a data pattern that detects the lowercase letter a using a low event threshold in all destination traffic on a small company network. Such a data pattern could overwhelm your system with literally millions of events in only a few minutes.