In summary, the most general tokenizing process will involve first identifying the document structure and then identifying words in text as any sequence of alphanumeric characters,
terminated by a space or special character,
with everything converted to lowercase.