Sanitizer#
Sanitizer’s job is to take in some HTML, do some magic, and at the end, will spit out HTML that is supposed to be safe.
If we take a look into what it actually do, it is quite simple:
- It first parse the HTML, creating a data object
- Then, it will work its magic, sanitize the input
- Finally, it serialize the object back into HTML, and feed it to the browser
- The browser simply reparses and renders it
So, TLDR: parse -> sanitize -> serialize -> reparse by browser -> render
However#
However, that’s a lie. Its job is not as simple as I made it to be. Mainly because HTML is a tolerant language. They are made with the intention of “one little mistake from a developer should not crash the whole website”.
Because of that, the HTML parser has to try its best to guess and fix the broken HTML that it receives. For example, <p>some text will be fixed into <p>some text</p> by the parser
To make it worse, there are many different HTML parsers, and although there is a standard, each one might do things a little different than others. This is called parser differentials
And sanitizing is not native to the browser, or parser. The most used library to sanitize HTML is DOMPurify, and it is developed and maintained by third-party.