Tokens

This is written for more advanced and technical users. It explains what NOW Privacy does when it ingests your data. An understanding of this process will help you to use search more effectively.

Advanced search offers three search types:

  • Matches any of the Words

  • Matches the Phrase

  • Matches a Wildcard *

Let’s look at these.

Match any of the words and Match the Phrase

When we index a document, we break it down in several ways.

Imagine that the text in a document is:

The cat ran in circles 10 times as Stephen entered the room

The main search index performs a lot of normalisation and processing that you don’t see. We would actually store in the index:

cat, run, circle, time, stephen, enter, room

Each of these is known as a token.

That’s all that gets indexed for the body.

Notice that NOW Privacy has normalised ‘ran’ to ‘run’.

When a user enters a search phrase (such as ‘Cat runs around’) and uses a Match Word or Match Phrase search, we apply exactly the same tokenisation process, giving:

cat, run, around

Notice that NOW Privacy has normalised ‘runs’ to ‘run’ this time because ‘run’ is the infinitive of the verb, and that’s what NOW Privacy uses for the normalised form.

Since both ‘cat’ and ‘run’ match, this document is likely to appear near the top of the results.

NOW Privacy is able to search quickly because it can do exact matches on the comparison. If you were to skip the query processing and search the index for the values Cat, cats, CATS (or any other variant of casing, conjugation or pluralisation), you’d get zero results.

Matches a Wildcard *

The approach we just described is fast and reliable but it has some drawbacks. For example, it strips out all punctuation, numerals, and alphanumerics, as well as losing the precise phrasing and syntax that was appeared in the original document.

But what happens if, for example, the punctuation is essential? Think of technical design documents that specify parts by a precisely formatted part number such as:

B201-200.6

It isn’t possible to search for that using NOW Privacy’s standard index and the techniques we described for Match any of the words and Match the Phrase.

So NOW Privacy has a secondary copy of the data that we create when we perform the indexing. That secondary index has an entirely different tokenisation rule:

Don’t separate on whitespace (or any other characters), don’t normalise the case, treat the entire document as a single token

So the wildcard index for the above raw document would have one token:

The cat ran in circles 10 times as Stephen entered the room

When you execute a wildcard query, it’s executed against that single token.

So all of these would match:

The*, The cat*, *10*, *Stephen*, *room

because:

  • ‘The’ followed by any characters matches

  • ‘The cat’ followed by any characters matches

  • ‘10’ precceeded and followed by any characters matches

  • ‘Stephen’ precceeded and followed by any characters matches

  • ‘room’ preceeded by any characters matches

And these would not match:

Stephen*, *stephen*, The Cat*, the cat*, The

because:

  • ‘Stephen’ followed by any characters doesn’t match - our token starts with ‘The cat’

  • ‘stephen’ preceeded and followed by any characters doesn’t match - our token includes ‘Stephen’ but not ‘stephen’

  • ‘The Cat’ followed by any characters doesn’t match - our token includes ‘cat’ but not ‘Cat’

  • ‘the cat’ followed by any characters doesn’t match - our token begins with ‘The’, not ‘the’

  • ‘The’ doesn’t match - our token is longer than just three characters!