This article applies to Drupal 5.x.
Processing textual content for output in a browser is one of Drupal's most critical tasks. Without such processing we would all have to become masters at typing in HTML text! In this article I will explain what filters and input formats are, why they are important, how they are used, and why they impact the security of your site.
Filters and Input Formats
The pillars of Drupal's text handling are filters and input formats. A filter is a set of rules that can be applied to transform text in some way. Some filters strip certain HTML tags or security hazards from text. Other filters look for special patterns and expand the text in a meaningful way. Other fun-oriented filters, such as the Pirate Filter, rewrite the text altogether (in this case, to make it "talk like a pirate"). Filters know how to do one thing, and do it well; text in, filtered text out.
Some filters have extra configuration options. The HTML filter, for example, strips all but an allowed set of HTML tags from text. The set of allowed tags can be determined by the administrator.
An input format is an ordered collection of filters. Any text that is being displayed to the browser should be run through the filters in an input format first. The input format then applies all of the filters, in the right order, so that one filter feeds its output to the next, forming a chain. This chaining of filters can be the source of great flexibility as well as great confusion. The flexibility comes from the fact that filters can be made to work together, the confusion comes from the case where filters inadvertently work against each other, one filter undoing the work of the previous filter. I'll show examples of both.
Input versus Output
Drupal captures input in its raw form, saving whatever gets submitted straight to the database without alteration. Then, before displaying any such content in the browser, Drupal processes the text by choosing an input format to apply. Why doesn't Drupal apply the filters in an input format before saving input into the database? The answer is simple; flexibility. If you were to change the text that a user has input before saving it in the database, you could never get back to the original state. You could never change your mind about the configuration of the filters. By filtering on output, not on input, Drupal gives the site administrator the option of changing how content is displayed at any time. As an example, imagine that you notice the users on your site using character patterns to represent smiley faces. I know, that stuff is so 1998 :P But just for fun, let's say they're doing it ;-) You look around and find the Smiley Filter on Drupal.org, and install it. Now all of the keystroke patterns that your users had been using can be displayed as images
This ability to change is only available if the input is saved verbatim and filtering is done on output.
Meet Drupal's Core Filters
Here is a rundown of the filters that Drupal ships with:
-
HTML Filter: The HTML filter is primarily responsible for removing HTML tags from text. It can be configured to allow any number of tags (whitelist) and it will remove the rest. It removes them either by stripping them, or by escaping them into entities like this: <div> If tags are escaped, they show up in the output as visible tags: <div>Some text</div>. The set of tags that are allowed by default include: <a> <em> <strong> <cite> <code> <ul> <ol> <li> <dl> <dt> <dd>
The final task of the HTML filter is to add a spam link deterrent to anchor tags. The deterrent, proposed by Google, gives search engines a tip about which links to follow when crawling the web. If this option is enabled, rel="nofollow" will be added as an attribute of all anchor tags.
- Line Break Converter: This filter converts line breaks into <br> or <p> tags depending on whether a single or double line break is found. This preserves the paragraph formatting in the text that is input.
- URL Filter: Any web or email addresses that are found in the text will be converted to clickable links, thus saving the user the hassle of having to type <a href="....">
- PHP Evaluator: The PHP Evaluator is the most radical of all Drupal's core filters. It looks for text enclosed in <?php ... ?> and evaluates it as PHP code. This effectively allows you to program and extend Drupal just by submitting content to the site! In 99% of cases, this is a bad idea, and the initial attraction of harnessing such power should be weighed by a healthy sense of fear. If you really need to write PHP code to accomplish what you're trying to do, writing a module is usually a better idea (and not that hard in most cases). Furthermore, in the wrong hands, the PHP Evaluator is an enormous security risk. A malicious attacker, with the PHP Evaluator at their disposal, could wipe out your database and take control of your web server.
Drupal's Core Input Formats
Drupal also comes with three input formats pre-defined.
- Filtered HTML: This is the workhorse input format that is used most of the time for displaying posts such as blogs, pages, forum topics and so forth. It combines the URL Filter, the HTML Filter and the Line Break Converter in a way that allows users a small set of HTML tags for formatting while taking care of paragraphs and URLs behind the scenes. This is also the default input format for new Drupal installations. More on default input formats later.
- PHP Code: This input format consists of only one filter, the PHP Evaluator filter. This input format is to be used when the goal is embedding PHP code in a post.
- Full HTML: The Full HTML input format applies only the Line Break Converter filter. No HTML tags are stripped and no weblinks are converted to anchor tags.
Order Matters
When an input format consists of more than one filter, the ordering of the filters has a huge impact on what the final output is. The Filtered HTML input format has three filters, the URL Filter, the HTML Filter, and the Line Break Converter. Here is the order in which they are executed in a new Drupal installation:
Assuming the HTML Filter allows the default set of tags (see the list above), let's examine what happens to some HTML text as it gets processed by the Filtered HTML input format. Here's the text:
<h1>The quick brown fox jumps over the lazy dog.</h1>
King Phillip came over from <em>Germany</em> swimming.<br><br>Every good boy deserves fudge.
http://drupal.org
The first filter is the URL filter. It will find the URL which we have in this text and make it into a proper anchor tag:
Before
http://drupal.org
After
<a href="http://drupal.org" title="http://drupal.org">http://drupal.org</a>
The second filter is the HTML Filter. The text contains two HTML tags that are not on the whitelist, namely <h1> and <br>. Thus, they will be stripped.
Before
<h1>The quick brown fox jumps over the lazy dog.</h1>
King Phillip came over from <em>Germany</em> swimming.<br><br>Every good boy deserves fudge.
After
The quick brown fox jumps over the lazy dog.
King Phillip came over from <em>Germany</em> swimming.Every good boy deserves fudge.
Finally, the Line Break Converter gets its chance. It looks for line break characters (\n, \n\n, etc.) and replaces them either with <br /> or encloses blocks of text in <p>...</p> tags. The function responsible for this is quite cunning, and was inspired by code from WordPress (our debt of gratitude).
Before (as received from the HTML Filter)
The quick brown fox jumps over the lazy dog.
King Phillip came over from <em>Germany</em> swimming.Every good boy deserves fudge.
<a href="http://drupal.org" title="http://drupal.org">http://drupal.org</a>
After
<p>The quick brown fox jumps over the lazy dog.<br />
King Phillip came over from <em>Germany</em> swimming.Every good boy deserves fudge.</p>
<p><a href="http://drupal.org" title="http://drupal.org">http://drupal.org</a></p>
So what went wrong here? Well, nothing, technically. But the output is unlikely to be what the user expected. First of all, the <h1> tag was stripped, which is a good thing because it is what the Drupal administrator wanted. Second, the place where the user wanted to make a line break using <br><br> is totally different than what the user might expect. After all, the final rendered HTML contains a <br /> tag, so why were the <br> tags from the user stripped out? And why weren't they replaced by something more intelligent by the Line Break Converter? The answer can be seen by looking at the the text as it gets passed from the HTML Filter to the Line Break Converter. Because <br> isn't on the whitelist of allowed tags, the HTML Filter takes them out, leaving the Line Break Converter no clues to follow concerning the user's wish for a line break between "swimming." and "Every good boy".
Changing the Order
Let's look at the example above and see what would happen if we change the order of the filters. This is done by clicking to Administer -> Site configuration -> Input formats -> (Filtered HTML) configure -> Rearrange. Here I've changed the order so that the HTML Filter comes after the Line Break Converter.
As in the first example, the first filter is the URL Filter, so the output from that will be the same. The second filter, though, is now the Line Break Converter. Here is what happens to our text coming from the URL Filter and going into the Line Break Converter:
Before (as received from the URL Filter)
<h1>The quick brown fox jumps over the lazy dog.</h1>
King Phillip came over from <em>Germany</em> swimming.<br><br>Every good boy deserves fudge.
<a href="http://drupal.org" title="http://drupal.org">http://drupal.org</a>
After
<h1>The quick brown fox jumps over the lazy dog.</h1>
<p>King Phillip came over from <em>Germany</em> swimming.<br><br>Every good boy deserves fudge.</p>
<p><a href="http://drupal.org" title="http://drupal.org">http://drupal.org</a></p>
Interesting to note is that no paragraph tag was placed on "The quick brown fox". This is because <h1> elements are block level elements (which get their own line breaks in rendered HTML), so the Line Break Converter ignores them. Also interesting is that we have many instances of <p> and <br> tags, even though we're about to go into the HTML Filter which is configured to strip those tags out. Here is the final output with the new ordering of the filters (line breaks added for readability):
The quick brown fox jumps over the lazy dog. King Phillip came over
from <em>Germany</em> swimming.Every good boy deserves fudge.
<a href="http://drupal.org" title="http://drupal.org">http://drupal.org</a>
What a mess =)
What's the real solution in this case? Well, the original order of filters worked better, so consider leaving it URL Filter -> HTML Filter -> Line Break Converter. One way to fix the problem would be to configure the HTML Filter to allow <br> and <p> tags. This gives HTML savvy users control over paragraph formatting. The other solution would be to submit a patch against the HTML filter (see filter.module) to have it replace <br> tags with \n line break characters so that the Line Break Converter will pick up on them. Guess which solution is easier :P
Input Formats and User Roles
Not all Drupal users are created equal. Some are anonymous users, some are authenticated users, and some have other user roles that allow them to have greater privileges than normal authenticated users. Furthermore, one Drupal user on every site is the super-user (user #1) who can do anything. The privilege of using an input format can be assigned to users on a per-role basis. This is an important mechanism that exists for allowing some trusted users to have access to some filters while denying this access to less trusted users.
Look at the screen Administer -> Site configuration -> Input formats. It lists all of the input formats that have been established. The default Drupal installation comes with three, as noted above. On any Drupal site, one input format has to be designated as the default input format. This is indicated by the radio button in the Default column. To guarantee the presence of at least one input format, the default format cannot be deleted. The others can be, however, and you might consider deleting any input formats (such as the PHP code format) that you don't plan on using.
On the configuration screen for an input format, you will see a listing of all the roles for users on your site. For any input format besides the default you can specify which user roles are privileged to use that input format. The default input format is automatically available to all users in all roles on your site and this cannot be changed.
Filters and Security
Filters and security go hand-in-hand. Without filters, there would be no security for your site as malicious attackers would have free reign in using scripts to deface your site, subject your users to phishing scams, and steal important data such as passwords.
The heart of the security offered by filters comes from from the HTML Filter and the calls it makes to filter_xss and check_plain. These are the functions that Drupal uses to prevent attacks based on user input. For this reason, all of your user submitted output should be run through the HTML Filter. It is tempting to ignore this advice, especially if you are having troubles getting the configuration settings just right for your purposes. Don't ignore this advice. You may end up sorry.
Also worth reiterating is the fact that the PHP Evaluator filter poses an extreme risk if it can be used by anyone but highly trusted, PHP-competent site administrators. Most sites will be better off deleting the PHP code input format and not extending use of the PHP Evaluator filter to anyone.
Finally, it should be obvious that the Full HTML input format, which does not use the HTML Filter, is insecure and should be offered only to those users who can be trusted not to ruin your site. Most sites will be better off deleting this input format.
Many More Filters Available
The fun with filters is that modules can offer their own filters. The number of filters available is large, and I can't possibly cover them all, but you can get a feel for the possibilities by looking at the Filters and Editors category of Drupal modules on Drupal.org. Here are some interesting modules that offer filters:
ModuleDescriptionAmazon FilterProvides a text filter to insert amazon book title/links, cover images, and themable formatted information using a simple [amazon {title|cover|info} ] tag.BBCode Allows users to specify markup using BBCode.Code FilterRenders syntax-highlighted PHP code. This module is used on Drupal.org.DruTex A LaTex renderer that can, among other things, render mathematical formulas and generate PDFs of nodes.HTML CorrectorCorrects corrupt HTML in nodes and comments. This is useful for cases where users forget to close tags, or for where the teaser view breaks the HTML.Inline FilterUses a [inline:filename.jpg] syntax to allow for inline images or file links.Markdown with SmartyPantsOne of my favorites, this allows simple ASCII formatting to be turned into HTML. For example, ##This would be a h2, and *this would be emphasized*. This module is in use on http://groups.drupal.org.Paging FilterBreak long pages into smaller ones by means of a "page" tag.Pirate FilterTurns English into Pirate speak.Smileys Parses smiley character combinations and replaces them with inline smiley images.Word FilterFilters a list of restricted words.
Conclusion
The filtering of output is an essential part of web publishing and one of Drupal's great strengths. Understanding the difference between input formats and filters, and how to configure each, is an essential step in becoming a great Drupal site administrator. Drupal modules can implement filters to make your site powerful and fun.