Section 5.6. Cross-Site Scripting (XSS)

5.6. Cross-Site Scripting (XSS)

Cross-site scripting, or XSS, is the name given to attempts to attack a site by submitting data that will then be displayed back to other users with undesirable effects. This covers everything from messing with stylesheets to capturing user's password inputs. XSS holes are places in an application's implementation that allow user data to be treated as untainted when either the data is fully tainted or the untainting process was ineffective.

XSS holes in public applications have received more press in the last couple of years than previously, as the techniques to exploit such holes become more advanced. In October 2005 we saw the first large-scale XSS worm, which attacked the MySpace.com social network. In around 20 hours, the worm spread to infect over one million accounts. Shortly afterward, MySpace.com was taken down to deal with the issue. While the worm was benign (it added the text "Samy Is My Hero" to account profiles), it could have easily been malicious, gathering user credentials or private information.

XSS is a hot topic and the attention that it's started to receive will ensure that there are hundreds more people out there adding it to their arsenal of exploits. As attention grows, the likelihood of any holes in your application being found and exploited also grows.

5.6.1. The Canonical Hole

Even if you're not planning on displaying user-entered HTML data in your application, you can easily fall prey to the most common HTML-based XSS hole. Often within an application you'll be passing data around between pages in the form of HTTP GET or POST parameters. In this example, a user accesses a page, passing along an ID in the GET query string, which is then stashed into another link on the page to let the user navigate somewhere else:

http://myapp.com/profile.php?id=11
...
<p>Take a look at <a href="/photos.php?id=11">Cal's Photos</a>.</p>
...

The PHP and Smarty sources for this page template might look like this:

<p>Take a look at <a href="/photos.php?id=<?=$_GET[id]?>"><?=$user[username]?>'s
Photos</a>.</p>
<p>Take a look at <a href="/photos.php?id={$smarty.get.id}">{$user.username}'s
Photos</a>.</p>

With the code as is, all anybody has to do to inject HTML into your pages is pass it along in the query string:

http://myapp.com/profile.php?id="><script>alert('hello');</script><

When we then take the ID value and write it into the page, we're unwittingly writing user-entered HTML into our pages. This situation is easy to avoid and is pretty unforgivable in professional applications. All we need to do is escape tainted data appropriately until it's filtered. Unlike Perl, PHP doesn't have a taint mode built in, so we need to manage tainting ourselves. A good rule of thumb is to declare certain variables as tainted (usually the family of superglobals, $_GET, $_POST, etc.) and always escape values from those. The data from the tainted variables must always be filtered before we can use it. This tends to create a policy of not trusting any data, which is a good mindset to have. We'll be looking further into why this is important at the end of the chapter, in the "SQL Injection Attacks" section.

It's important to also note that it's not just the usual suspects ($_GET, $_POST, $_COOKIES) that should be considered user entered and thus tainted. The $_SERVER and $_ENV superglobals don't actually contain data purely from the server and its environment. For instance, $_SERVER['HTTP_HOST'] and $_SERVER['REQUEST_URI'] both come directly from the client's request. The following PHP code idiom is actually vulnerable to attack:

<form action="<?=$_SERVER['PHP_SELF']?>">

The value for $_SERVER['PHP_SELF'] comes from the client request for the page, so we know what it's going to be, right? Of course notthat would be too easy. Consider our example script is in a file called foo.php. The user can then inject HTML by making this request:

http://myapp.com/foo.php/"><script>alert('hello');</script><

Ouch. It's important to treat all data from external sources as tainted. Only data you've untainted yourself is safe.

Escaping data to be displayed in HTML source is trivialwe just need to replace the four XML metacharacters (< > & ") with their entities (< > & "). In PHP, this facility is provided by the HtmlSpecialChars( ) function, and in Smarty by the escape modifier. Our source code thus becomes:

<p>Take a look at <a href="/photos.php?id=<?=HtmlSpecialChars($_GET[id])?>"><?=HtmlSpecialChars($user[username])?>'s
Photos</a>.</p>
<p>Take a look at <a href="/photos.php?id={$smarty.get.id|escape}">{$user.username|escape}'s
Photos</a>.</p>

It's not only visible HTML source that needs to be written carefully. The most commonly seen hole is in the escaping of hidden form parameters. Consider this snippet of code from the same page:

<input type="hidden" name="id" value=
"<?=$_GET[id]?>" />

The ID entered in the URL query string will still show up in the source and cause the same hole. Any user-entered data that you write out (you must always consider any GET or POST variables as user entered) is vulnerable to HTML injection attacks. Anything you haven't pulled out of a database or explicitly filtered must be escaped when outputting HTML. In a situation where you've pulled data from a database, you need to identify which values can possibly still be tainted. A numeric user ID from a database can't be tainted if the database doesn't store anything but numbers, but a username field in the same table could be tainted.

With thousands of input vectors, these kinds of problems can eventually sneak into large applications. In addition to strictly escaping all your output, writing a vulnerability scanner is fairly simple. A simple vulnerability scanner can just crawl between pages in your application, injecting attack data in any cookies, GET and POST variables, HTTP headers (such as spoofed Host headers), and URLs it finds. If after sending the attack data it gets back a page with the same data unescaped, then there's a vulnerability in the application. A scanner like this, however, is no replacement for careful programming. It's extremely important that each member of your engineering team who deals with tainted data has a full understanding of the risks involved and how to avoid exposing users to unfiltered user-entered data.

5.6.2. User Input Holes

In this section we'll look at potential XSS holes and develop a filtering function to eliminate them. We'll start with a naive HTML input parser and build on it each time we find a new flaw.

While it may at first seem like we're reinventing the wheel here and that XML parsers already do a good job of finding well formed XML, the need for something beyond a parser is clear. If user input contains badly formed XML, we don't want to throw it away or simply strip out tags. By looking at each tag or tag-like construct, we can dig meaning out of even the worst heap of bad HTML.

The most basic function is going to take a list of allowed tags and parse out everything else. In PHP, we can do this in a few lines:

function filter_html($input){
        return preg_replace_callback('!</?([a-z0-9]+)[^<>]*>!i', 'filter_html_tag',
        $input);
}
function filter_html_tag($matches){
        $allowed = array('a', 'b', 'img');
        if (in_array(StrToLower($matches[1]), $allowed)){
                return $matches[0];
        }
        return '';
}

5.6.3. Tag and Bracket Balancing

Our function here matches every tag, checks its tag name, and then either allows it through verbatim or removes it completely. This is clearly flawed; it doesn't even deal with attributes. But there's a deeper problem we need to address first. Consider the following inputs:

 <script<script>> 
 <<script>script<script>> 
 <scr<!-- foo -->ipt>

With any of these inputs, our simple function will allow the evil script tag through. We need to ensure that the input we're processing for tags doesn't contain angle brackets that aren't connected to a tag. The naive solution to this problem is to balance the brackets first, like so:

$data = preg_replace("/>>+/", ">", $data);
$data = preg_replace("/<<+/", "<", $data);
$data = preg_replace("/^>/", "", $data);
$data = preg_replace("/<([^>]*?)(?=<|$)/", "<$1>", $data);
$data = preg_replace("/(^|>)([^<]*?)(?=>)/", "$1<$2", $data);

The first two lines remove repeated sequences, which makes matching other brackets a lot easier. This is typically a user typo in any case. It might be better to convert a sequence such as <<< into <<<, just so the user can see what happenedbut that's a slightly more advanced topic. The third expression deals with closing brackets at the beginning of a line, since this becomes a tricky case to deal with in our main rules. The final two rules match opening and closing brackets, which don't have a corresponding closing or opening bracket. The rules use zero-width positive look-ahead assertions (the ?= syntax), which allows each iteration to match something and still leave it available for the next iteration. This is needed to match groups of unbalanced brackets next to each other, or we'd need to perform the replacement in a loop, replacing one bad sequence at a time.

The problem with this approach alone is that it only balances the brackets themselves and allows unbalanced tags to get through. To ensure balanced tags, we need a stack-based balancer that operates as we match tags. The code for that looks a little like this:

$stack = array( );
...match tags, calling match_start and match_end for each tag...
function match_start($tag_name){
        # we're starting a new tag - place it on the stack
        $stack[] = $tag_name;
}
function match_end($tag_name){
        # we're ending a tag - find it on the stack
        # we need to return the HTML to replace the end tag with if any
        $ret = '';
        while(count($stack)){
                $tag = array_pop($stack);
                $ret .= "</$tag>";
                if ($tag == $tag_name){
                        return $ret;
                }
        }
        # we removed everything on the stack and
        # didn't find an opening version of $tag_name
        # so we'll just discard it
        return $ret;
}

Here we had to make a design decision. When we come across a closing tag that isn't at the top of the stack, should we search down the stack for it until we find it, closing tags on the way, or just close it now and remove it from the top of the stack if it's there (or else leave the stack alone)? A possible compromise is to look down a couple of levels in the stack. If we see the tag being opened, then we close down to that tag. If we don't see that tag being opened near the top of the stack, then we just discard the closing tag.

While this should catch most mistakes and will always create valid markup, it might not always be what the user intended. It's worth playing around a little with some example mistakes and see what you'd expect to happen. To add to the user experience, it can be worth displaying a message to users when data has been dropped from their input, so that they can see what happened. In some applications, assuming the data size isn't too large, it can be useful to store two copies of user dataone unfiltered and one filtered. For all display you can use the filtered copy, but when the users modify the data they will get the unfiltered copy. This allows users to correct their own mistakes and ensures you never lose user-entered data. Deleting a whole paragraph of user text because of a stray bracket can be a pretty bad user experience.

With our balancing logic working correctly, there are still some common cases we're not dealing with correctly. In both HTML and XHTML, there are tags that don't require closing and should be self-closed, such as <img /> and <br /> tags. For these tags, we'll need some special casing logic. For self-closing tags, we need to ensure we always self-close them and never enter them onto the stack. If we ever encounter an end tag for them, we can automatically discard it without checking the stack. We also need to ensure that we don't allow other tags to self-close.

There are still a couple of oddities left to deal with. For the sake of cleanliness, we'll probably want to remove some tags if they don't have any content. For instance, a <b> that opens and closes without any content is redundant. Some tags, such as the <a> tag, can be removed if they don't contain any attributes, or in the case of the <img /> tags, if they don't contain a particular attribute. This is only a matter of style, as these tags provide no threat of XSS vulnerabilities.

There can be further trouble in the form of comments and CDATA sections, although this is purely cosmetic. The following is an almost valid comment:

<!-- foo > bar -->

But using our rules, we'll balance up the first and last bracket pairs and end up outputting the word bar. The same can be true of CDATA sections, where brackets are allowed to appear mid tag. Although again purely cosmetic, we can remove these tags first to make display cleaner. In the case of CDATA sections, we want to keep the contents but escape them into PCDATA. In the case of comments, we want to remove the whole thing:

$data = preg_replace("/<!--(.*?)-->/s", "", $data);
$data = preg_replace_callback("/<![CDATA[(.*?)]]>/s", 'escape_cdata_section', $data);
function escape_cdata_section($matches){
        return HtmlSpecialChars($matches[1]);
}

5.6.4. Protocol Filtering

We mentioned earlier that filtering out undesirable elements and attributes is only a portion of the solution. Another part of our filtering library deals with input that creates partial tags, while another deals with making sure all tags open and close correctly. There's a final piece of the puzzle leftthe content inside some of the attributes.

Consider an application with a whitelist that allows hyperlinks and images. The following user input would be allowed:

 <a href="javascript:foo"> 
 <img src="javascript:foo">

This is still within the rules of our whitelist but clearly isn't what we want. Simply removing these attributes from our whitelist isn't a great option eitherallowing users the abilityto input hyperlinks and images is core to many application functions.

The attribute filtering we're going to need to add applies to all attributes that point to an external resourcethe href attribute of the a element, the src attribute of the img element, and so on. Each of these is a URL, which means we can nicely break them all down into the following format:

 protocol ":" protocol-dependant-address

Some examples show that nearly all input falls into this format:

mailto:cal@iamcal.com
http://iamcal.com/
ftp://iamcal.com/
javascript:alert('hello world');
about:blank

The only kind of format that doesn't fall into this format are relative URLs, which don't include a protocol. The decision of whether to allow relative URLs largely depends on how the data will be outputted. If the data will only be shown on the origin web site, then a relative URL might be acceptable. If it's going to be shown on a remote site, in a local context, or in an email, then relative URLs are not going to work. It's usually relatively trivial to rewrite relative URLs into their absolute form, based on the URL of the input page.

We know from our element and attribute filters that whitelists are the way to go, so we can create a minimalist list of known safe protocols:

mailto
http
https
ftp

FTP is perhaps one you'll want to exclude, given that FTP is a fairly specialized service these daysfor the same reason, nntp, ssh, sftp, and gopher are probably not on your list. The ones you definitely want to exclude are the dangerous ones like javascript, vbscript, and about.

To filter the protocol, we need to take the attribute contents and split the string at the first colon and match it against our whitelist. It couldn't be easier, right? Of course, there are a couple of problems with this approach. And there's also the issue of relative URLs: if we want to allow them, how do we find the protocol? Let's start with our basic protocol matcher and see what it finds:

<a href="java script:foo">
<a href="java{\t}script:foo">
<a href="java{\n}script:foo">
<a href="java{\0}script:foo">
<a href=" javascript:foo">
<a href="JaVaScRiPt:foo">

It turns out that there's a lot of munging that you have to perform before you have a normalized protocol string to match against. Spaces at the beginning and end have to be stripped, but so do spaces in the middle. All whitespace and formatting characters need to be stripped. Casing needs to be normalized. All of the above examples work in IE6, with several working in Mozilla.

These formatting tricks aren't necessarily importantyour whitelist will exclude them. The exception is where valid protocols are excluded, as with these plausible inputs:

<a href="http:foo">
<a href="http{\n}:foo">
<a href=" http:foo">
<a href="http :foo">
<a href="HTTP:foo">

If our protocol checker just searches for a colon to find the protocol, then it's vulnerable to various techniques all grouped under the heading of protocol hiding. Any data in HTML can be escaped using character entities:

<a href="&#106;&#97;&#118;&#97; &#115;&#99;&#114;&#105;&#112;&#116;&#58;foo">

In fact, most browsers try to be a bit helpful by making minor corrections, such as replacing a missing semicolon. So the following, although invalid, will also work as an attack:

<a href="&#106&#97&#118&#97 &#115&#99&#114&#105&#112&#116&#58;foo">

If the attacker wants to stick to the valid but unexpected, he can prefix some zeros to the numbers. As many as he likes, in fact:

<a href="&#0000106;&#0000097; &#0000118;&#0000097;&#0000115;&#0000099; &#0000114;&#0000105; &#0000112;&#0000116; &#0000058;foo">

And it's not only decimals allowed in numbered entitieswe can use hex numbers, too, in either case (or mixed case), with or without semicolons and with as many leading zeros as we can muster:

<a href="&#x6A;&#x61;&#x76; &#x61;&#x73;&#x63;&#x72;&#x69;&#x70;&#x74;&#x3A; foo">

As if that weren't enough, inside URLs we also get the luxury of URL encoding, where we use a percentage sign followed by two hex digits (in any case):

<a href="%6A%61%76%61%73%63%72 %69%70%74%3Afoo">

That's a lot of variations. There are more still: with named entities, denormalized UTF-8 sequences, and Unicode character entities. The best thing to do in this situation is to cheat and use somebody else's code. The number of ways in which HTML source filters can be exploited is growing every year, as the browsers add more and more "helper" functions to correct invalid user code. The dangers posed by unfiltered or incorrectly filtered user data grow with each new exploit, as attackers find more and more innovative ways to steal user data. For high-profile applications, protection against XSS can only become more important.