Section 8.2. Designing the Scanner

8.2. Designing the Scanner

Before we start actually building the scanner, we need to define the functional requirements and overall structure of how the scanner should operate.

8.2.1. Functional Requirements

The first thing our scanner will do is obtain data about the target application from which to generate its test requests. To run customized testing routines that are designed for a specific web application, you must somehow obtain data about the application. Application spidering, or crawling, is a very effective technique you can perform to "inventory" or record legitimate application pages and input parameter combinations. You can automatically crawl an application using existing utilities such as Wget, or you can do it manually with the help of a local proxy server such as Odysseus or Burp. Most of the commercial application scanners, such as Sanctum's AppScan and SPI Dynamics' WebInspect, offer users both of these data-collection methods. The goal in either case is to build a collection of request samples to every application page as a basis on which to build the list of test requests for the scanner to make.

Although the automated technique is obviously faster and easier, it has a disadvantage in that it might not effectively discover all application pages for a variety of reasons. Primarily, the crawl agent must be able to parse HTML forms and generate legitimate form submissions to the application. Many applications present certain pages or functionality to the user only after a successful form submission. Even if the spidering agent can generate form parsing and submissions, many applications require the submissions to contain legitimate application data; otherwise, the application's business logic prevents the user from reaching subsequent pages or areas of the application. Another thing to consider with automated spidering agents is that because they typically follow every link and/or form a given web application presents, they might cause unanticipated events to occur. For example, if a hyperlink presented to the user allows certain transactions to be processed or initiated, the agent might inadvertently delete or modify application data or initiate unanticipated transactions on behalf of a user. For these reasons, most experienced testers normally prefer the manual crawling technique because it allows them to achieve a thorough crawl of the application while maintaining control over the data and pages that are requested during the crawl and ultimately are used to generate test requests.

Our scanner will rely on a manual application crawl to discover all testable pages and requests. To accomplish this, we will use one of the many available freeware local proxy server utilities to record all application requests in a log file as we manually crawl the application. To extract the relevant data from the log file, first we will need to create a log file parsing script. The parsing script is used to generate a reasonably simple input file that our scanner will use. By developing a separate script for log file parsing, our scanner will be more flexible because it will not be tied to a specific local proxy utility. Additionally, this will give us more control over the scan requests because we will be able to manually review the requests that are used to perform testing without having to sift through a messy log file. Keep in mind that the input file our scanner will use should contain only legitimate, or untainted, application requests. The specific attack strings used to test the application will be generated on-the-fly by our scanner.

Now that we know how the scanner will obtain data about the application (an input file generated from a manual crawl proxy log file), we must decide what tests our scanner will conduct and how it will perform its testing. For web applications, we can perform a series of common tests to identify some general application and/or server vulnerabilities. First we will want to perform input validation testing against each application input parameter. At a minimum we should be able to perform tests for SQL injection and XSS, two common web application vulnerabilities. Because we will be performing these tests against each application input parameter, we refer to them as parameter-based tests.

In addition to parameter-based testing, we will want to perform certain tests against each application server directory. For example, we will want to make a direct request to each server directory to see if it permits a directory listing that exposes all files contained within it. We also will want to check to see if we can upload files to each directory using the HTTP PUT method because this typically allows an attacker to upload his own application pages and compromise both the application and the server. Going forward we refer to these tests as directory-based tests.

For reporting purposes, our scanner should be able to report the request data used for a given request if it discovers a potential vulnerability, and report some information regarding the type of vulnerability it detected. This information will allow us to analyze and validate the output to confirm identified issues. As such, our scanner should be able to generate an output file in addition to printing output to the screen. The final requirement for our scanner is the ability to use HTTP cookies. Most authenticated applications rely on some sort of authentication token or session identifier that is passed to the application in the form of a cookie. Even a simple scanner such as the one we are building needs to have cookie support to be useful.

8.2.2. Scanner Design

Now that we have defined the basic requirements for our scanner, we can start to develop an overview of the scanner's overall structure. Based on our requirements, two separate scripts will be used to perform testing. We will use the first script to parse the proxy log file and generate an input file with the request data to be used for testing. The second script will accept the input file and perform various tests against the application based on the pages and parameter data contained in the file.

8.2.2.1 parseLog.pl

Our first script is called parseLog.pl , and it is used to parse the proxy server log file. This script accepts one mandatory input argument containing the name of the file to be parsed. The script's output is in a simple format that our scanner can use as input. At this point, it probably makes sense to define the actual structure of the input file and the requests contained within it. We must keep in mind here that we most likely will see the following types of requests in our log file:

GET requests (without a query string)
GET requests (with a query string)
POST requests

To handle these request types, we generate a flat text file with one line for each request, as shown in Example 8-5. The first portion of the line contains the request method (GET or POST), followed by a space, and then by the path to the resource being requested. If the request uses the GET method with query string data, it is concatenated to the resource name using a question mark (?). This is the same syntax used to pass query string data as defined by HTTP, so it should be fairly straightforward. For POST requests, the POST data string is concatenated to the resource name using the same convention (a question mark). Because it is a POST request, the scanner knows to pass the data to the server in the body of the HTTP request rather than in the query string.

Example 8-5. Sample input file entries

GET /public/content/jsp/news.jsp?id=2&view=F
GET /public/content/jsp/news.jsp?id=8&view=S
GET /images/logo.gif
POST /public/content/jsp/user.jsp?fname=Jim&Lname=Doe
POST /public/content/jsp/user.jsp?fname=Jay&Lname=Doe
GET /images/spacer.gif
GET /content/welcome.jsp

Another nice thing about using this input file format is that it enables us to easily edit the entries by hand, as well as easily craft custom entries. Because the script's only purpose is to generate input file entries, we don't need it to generate a separate output file. Instead, we simply use the greater-than (>) character to redirect the script's output to a local file when we run it to save it to a file. You will also notice that the input file contains no hostname or IP address, giving us the flexibility to use the input file against other hostnames or IP addresses if our application gets moved.

As for the proxy server that our parsing script supports, we are using the Burp freeware proxy server (http://www.portswigger.net). We chose Burp because of its multiplatform support (it's written in Java) and because, like many local proxy tools, it logs the raw HTTP request and response data. Regardless of which proxy tool you use, as long as the log file contains the raw HTTP requests the parsing logic should be virtually identical. We will more closely examine the Burp proxy and its log format a bit later in the chapter.

8.2.2.2 simpleScanner.pl

Now that we have a basic design of our log file parsing script we can start designing the actual scanner, which is called simpleScanner.pl . We have already stated that the script needs to accept an input file, and based on the format of the input file we just defined, the script also needs to include a second mandatory input argument consisting of the hostname or IP address to be tested. In addition to these two mandatory input arguments, we also need to have some optional arguments for our scanner. When we defined the scanner requirements, we mentioned that the tool would need to be able to generate an output file and support HTTP cookies. These two features are better left as optional arguments because they might not be required under certain circumstances. We also will add an additional option for verbosity so that our scanner has two levels of reporting.

At the code level, we will develop a main script routine that controls the overall execution flow, and we will call various subroutines for each major task the scanner needs to perform. This allows us to segregate the code into manageable blocks based on overall function, and it allows us to reuse these routines at various points within the execution cycle. The first task our scanner needs to do is to read the entries from the input file. Once the file has been parsed, each individual request is parsed to perform our parameter- and directory-based testing.

A common mistake when testing application parameters for input validation is to fuzz, or alter, several parameters simultaneously. Although this approach allows us to test multiple parameters at once, contaminated data from one parameter might prevent another from being interpreted by the code. For our parameter-based testing, only one parameter will be tested at a time while the remaining parameters contain their original values obtained from the log file entry. In other words, there will be one test request for each variable on each application page. To minimize the number of unnecessary or redundant test requests, we also track each page and the associated parameter(s) that are tested. Only unique page/parameter combinations will be tested to avoid making redundant test requests.

Once every parameter of a given request has been tested, all parameter values are stripped from the request and the URL path is truncated at each directory level to perform directory-based testing. Again, one request is made for each directory level of the URL path, and we keep track of these requests to avoid making duplicate or redundant requests. Figure 8-1 visually represents the logic of our tool.

Figure 8-1. Visual representation of scanner logic

Now we are almost ready to begin coding our scanner, but first we should quickly review the process of generating test data using a local proxy server.

8.2.3. Generating Test Data

You can use any local proxy server to record a manual crawl of an application, provided it supports logging of all HTTP requests. Most proxy servers of this type also natively support SSL and can log the plain-text requests the browser makes when using HTTPS. Once the manual crawl is complete we should have a log file containing all the raw HTTP requests made to the application.

Our logParse script is designed to work with the Burp proxy tool. Burp is written in Java and you can freely download it from the PortSwigger web site mentioned earlier. You will also need a Java Runtime Environment (JRE), preferably Sun's, installed on the machine on which you want to run Burp. You can download the most recent version of the Sun JRE from http://www.java.com/en/download/.

Once you download and run Burp, you need to make sure logging to a file has been enabled and you are not intercepting requests or responses. By logging without intercepting, the proxy server seamlessly passes all HTTP requests back and forth without requiring any user interaction, and it logs all requests to the log file. Figure 8-2 shows the Burp options necessary to generate the activity log.

Figure 8-2. Burp options screen

You also need to set your web browser to use Burp as a proxy server (by default the hostname is localhost and the port number is 5000). Because the goal of this phase is to inventory all application pages and parameters, no testing or parameter manipulation should be done during the crawl. The log file should ideally contain only legitimate application requests and legitimate parameter values. We want to ensure that when crawling the web application all application links are followed and all application forms are submitted successfully. Once you have successfully crawled the entire application, you should make a copy of the log file to use for testing.

The log file generated during the application crawl contains a plain-text record of all data, including potentially sensitive information, passed to the application. This will likely include the username and password used to authenticate to the application.