How to detect keywords or phrases in the body content of messages

Background

Email messages can contain two sections with text data that get rendered in an email client when a message is viewed by a user:

  • text/html (or just html) section
  • text/plain (or just plain) section

As the section names imply, the html section can contain HTML mark-up such as hyperlinks, embedded images, text formatting, and more.

The plain section does not render HTML, and is displayed raw. Email clients are typically configured to display the HTML section by default to the end-user, if it's present, and only display the plain section as a fall-back. Email messages are not required to use both of these sections.

HTML content in MQL

For HTML bodies, content is stored in two fields:

  • body.html.raw
  • body.html.inner_text

The original HTML body is preserved in the raw field, and the internal decoded is stored in inner_text. Unless you need to match specific HTML elements, it's best to prefer body.html.inner_text.

Example HTML body

<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml" xmlns:v="urn:schemas-microsoft-com:vml"
      xmlns:o="urn:schemas-microsoft-com:office:office">
<head>
    <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
    <style type="text/css" media="all">/* a lot of CSS, totally ignored! */</style>
</head>
<body width="100%">
<span style="color:transparent;visibility:hidden;display:none;opacity:0;height:0;width:0;font-size:0;">This &amp; that are in a hidden span.</span><img
        src="https://test.local/img.png"> <!-- Here's a commentInsert &zwnj;&nbsp; hack after hidden preview text -->
</div>
<p>Some paragraph content, before a table</p>
<table>
<tr>
    <td>Row 1, Column 1. <span>Span contents inside R1C1</span></td>
    <td>Row 1, Column 2</td>
</tr>
<tr>
    <td>Row 2, Column 1</td>
    <td>Row 2, Column 2</td>
</tr>
</table>

<!-- comment before an image link -->
<a href="https://test.local"><img src="https://test.local/img.png"></a>
<div>Copyright &copy; 2022</div>
</body>
</html>

The inner text from the parsed HTML is much more compact. Note that newlines are automatically inserted between tags, regardless of whether they display on the same line visually.

This & that are in a hidden span.
Some paragraph content, before a table
Row 1, Column 1
Span contents inside R1C1
Row 1, Column 2
Row 2, Column 1
Row 2, Column 2
Copyright ยฉ 2022

๐Ÿšง

When searching inside HTML contents

Due to the size of HTML content, searching inside body.html.raw can be very time intensive. For better performance, consider writing rules that use body.html.inner_text instead, which contains the unescaped text inside the HTML, with different tags over different lines. The parsed HTML field, body.html.inner_text, is much smaller and is significantly faster to search.

Plain content in MQL

For plain bodies, content is stored in body.plain.raw.

Detect specific keywords

We can search both text sections easily to detect specific keywords or phrases:

any([body.plain.raw, body.html.inner_text], ilike(., "*voicemail*", "*password reset*"))

Detect complex phrases

We can use regular expressions if we're looking for something more complex, like a Social Security Number:

any([body.plain.raw, body.html.inner_text], regex_search(., '\b(\d\d\d)-(\d\d)-(\d\d\d\d)\b'))

Example detection rules

View example rules for body.html and body.plain.