getContent() invokes a content handler whenever it's called to retrieve an object at some URL. The content handler must read the flat stream of data produced by the URL's protocol handler (the data read from the remote source), and construct a well-defined Java object from it. By "flat," I mean that the data stream the content handler receives has no artifacts left over from retrieving the data and processing the protocol. It's the protocol handler's job to fetch and decode the data before passing it along. The protocol handler's output is your data, pure and simple.
The roles of content and protocol handlers do not overlap. The content handler doesn't care how the data arrives, or what form it takes. It's concerned only with what kind of object it's supposed to create. For example, if a particular protocol involves sending an object over the network in a compressed format, the protocol handler should do whatever is necessary to unpack it before passing the data on to the content handler. The same content handler can then be used again with a completely different protocol handler to construct the same type of object received via a different transport mechanism.
Let's look at an example. The following lines construct a URL that points to a GIF file on an FTP archive and attempt to retrieve its contents:
try { URL url = new URL ("ftp://ftp.wustl.edu/graphics/gif/a/apple.gif"); Image img = (Image)url.getContent(); ...
When we construct the URL object, Java looks at the first part of the URL string (i.e., everything prior to the colon) to determine the protocol and locate a protocol handler. In this case, it locates the FTP protocol handler, which is used to open a connection to the host and transfer data for the specified file.
After making the connection, the URL object asks the protocol handler to identify the resource's MIME type.[5] It does this through a variety of means, but in this case it probably just looks at the filename extension (.gif ) and determines that the MIME type of the data is image/gif. The protocol handler then looks for the content handler responsible for the image/gif type and uses it to construct the right kind of object from the data. The content handler returns an Image object, which getContent() returns to us as an Object; we cast this Object back to the Image type so we can work with it.
[5] MIME stands for Multipurpose Internet Mail Extensions. It's a standard design to facilitate multimedia email, but it has become more widely used as a way to specify the treatment of data contained in a document.
In an upcoming section, we'll build a simple content handler. To keep things as simple as possible, our example will produce text as output; the URL's getContent() method will return this as a String object.
As I said earlier, there's no standard yet for where content handlers should be located. However, we're writing code now and need to know what package to place our class files in. In turn, this determines where to place the class files in the local filesystem. Because we are going to write our own standalone application to use our handler, we'll place our classes in a package in our local class path and tell Java where they reside. However, we will follow the naming scheme that's likely to become the standard. If other applications expect to find handlers in different locations (either locally or on servers), you'll simply have to repackage your class files according to their naming scheme and put them in the correct place.
Package names translate to path names when Java is searching for a class. This holds for locating content-handler classes as well as other kinds of classes. For example, on a UNIX- or DOS-based system, a class in a package named net.www.content would live in a directory with net/www/content/ as part of its pathname. To allow Java to find handler classes for arbitrary new MIME types, content handlers are organized into packages corresponding to the basic MIME type categories. The handler classes themselves are then named after the specific MIME type. This allows Java to map MIME types directly to class names.
According to the scheme we'll follow, a handler for the image/gif MIME type is called gif and placed in a package called net.www.content.image. The fully qualified name of the class would then be net.www.content.image.gif, and it would be located in the file net/www/content/image/gif.class, somewhere in the local class path or on a server. Likewise, a content handler for the video/mpeg MIME type would be called mpeg, and there would be an mpeg.class file located (again, on a UNIX-/DOS-like filesystem) in a net/www/content/video/ directory somewhere in a local class path or on a server.
Many MIME type names include a dash (-), which is illegal in a class name. You should convert dashes and other illegal characters into underscores when building Java class and package names. Also note that there are no capital letters in the class names. This violates the coding convention used in most Java source files, in which class names start with capital letters. However, capitalization is not significant in MIME type names, so it's simpler to name the handler classes accordingly. Table 9.1 shows how some typical MIME types are converted to package and class names.[6]
[6] The "pre-beta 1" release of HotJava has a temporary solution that is compatible with the convention described here. In the HotJava properties file, add the line: java.content.handler.pkgs=net.www.content.
MIME type | Package name | Class name | Class location |
---|---|---|---|
image/gif | net.www.content.image | gif | net/www/content/image/ |
image/jpeg | net.www.content.image | jpeg | net/www/content/image/ |
text/html | net.www.content.text | html | net/www/content/text/ |
In this section, we'll build a simple content handler that reads and interprets tar (tape archive) files. tar is an archival format widely used in the UNIX-world to hold collections of files, along with their basic type and attribute information.[7] A tar file is similar to a ZIP file, except that it's not compressed. Files in the archive are stored sequentially, in flat text or binary with no special encoding. In practice, tar files are usually compressed for storage using an application like UNIX compress or GNU gzip and then named with a filename extension like .tar.gz or .tgz.
[7] There are several slightly different versions of the tar format. This content handler understands the most widely used variant.
Most Web browsers, upon retrieving a tar file, prompt the user with a File Save dialog. The assumption is that if you are retrieving an archive, you probably want to save it for later unpacking and use. We would like to instead implement a tar content handler that allows an application to read the contents of the archive and give us a listing of the files that it contains. In itself, this would not be the most useful thing in the world, because we would be left with the dilemma of how to get at the archive's contents. However, a more complete implementation of our content handler, used in conjunction with an application like a Web browser, could generate output that lets us select and save individual files within the archive.
The code that fetches the .tar file and lists its contents looks like this:
try { URL listing = new URL("http://somewhere.an.edu/lynx/lynx2html.tar"); String s = (String)listing.getContents(); System.out.println( s ); ...
We'll produce a listing similar to the UNIX tar application's output:
Tape Archive Listing: 0 Tue Sep 28 18:12:47 CDT 1993 lynx2html/ 14773 Tue Sep 28 18:01:55 CDT 1993 lynx2html/lynx2html.c 470 Tue Sep 28 18:13:24 CDT 1993 lynx2html/Makefile 172 Thu Apr 01 15:05:43 CST 1993 lynx2html/lynxgate 3656 Wed Mar 03 15:40:20 CST 1993 lynx2html/install.csh 490 Thu Apr 01 14:55:04 CST 1993 lynx2html/new_globals.c ...
Our content handler dissects the file to read the contents and generates the listing. The URL's getContent() method returns that information to our application as a String object.
First we must decide what to call our content handler and where to put it. The MIME-type hierarchy classifies the tar format as an "application type extension." Its proper MIME type is then application/x-tar. Therefore, our handler belongs to the net.www.content.application package, and goes into the class file net/www/content/application/x_tar.class. Note that the name of our class is x_tar, rather than x-tar; you'll remember the dash is illegal in a class name so, by convention, we convert it to an underscore.
Here's the code for the content handler; compile it and place it in the net/www/content/application/ package, somewhere in your class path:
package net.www.content.application; import java.net.*; import java.io.*; import java.util.Date; public class x_tar extends ContentHandler { static int RECORDLEN = 512, NAMEOFF = 0, NAMELEN = 100, SIZEOFF = 124, SIZELEN = 12, MTIMEOFF = 136, MTIMELEN = 12; public Object getContent(URLConnection uc) throws IOException { InputStream is = uc.getInputStream(); StringBuffer output = new StringBuffer( "Tape Archive Listing:\n\n" ); byte [] header = new byte[RECORDLEN]; int count = 0; while ( (is.read(header) == RECORDLEN) && (header[NAMEOFF] != 0) ) { String name = new String(header, 0, NAMEOFF, NAMELEN).trim(); String s = new String(header, 0, SIZEOFF, SIZELEN).trim(); int size = Integer.parseInt(s, 8); s = new String(header, 0, MTIMEOFF, MTIMELEN).trim(); long l = Integer.parseInt(s, 8); Date mtime = new Date( l*1000 ); output.append( size + " " + mtime + " " + name + "\n" ); count += is.skip( size ) + RECORDLEN; if ( count % RECORDLEN != 0 ) count += is.skip ( RECORDLEN - count % RECORDLEN); } if ( count == 0 ) output.append("Not a valid TAR file\n"); return( output.toString() ); } }
Our x_tar handler is a subclass of the abstract class java.net.ContentHandler. Its job is to implement one method: getContent(), which takes as an argument a special "protocol connection" object and returns a constructed Java Object. The getContent() method of the URL class ultimately uses this getContent() method when we ask for the contents of the URL.
The code looks formidable, but most of it's involved with processing the details of the tar format. If we remove these details, there isn't much left:
public class x_tar extends ContentHandler { public Object getContent( URLConnection uc ) throws IOException { // get input stream InputStream is = uc.getInputStream(); // read stream and construct object // ... // return the constructed object return( output.toString() ); } }
That's really all there is to a content handler; it's relatively simple.
The java.net.URLConnection object that getContent() receives represents the protocol handler's connection to the remote resource. It provides a number of methods for examining information about the URL resource, such as header and type fields, and for determining the kinds of operations the protocol supports. However, its most important method is getInputStream(), which returns an InputStream from the protocol handler. Reading this InputStream gives you the raw data for the object the URL addresses. In our case, reading the InputStream feeds x_tar the bytes of the tar file it's to process.
The majority of our getContent() method is devoted to interpreting the stream of bytes of the tar file and building our output object: the String that lists the contents of the tar file. Again, this means that this example involves the particulars of reading tar files, so you shouldn't fret too much about the details.
After requesting an InputStream from the URLConnection, x_tar loops, gathering information about each file. Each archived item is preceded by a header that contains attribute and length fields. x_tar interprets each header and then skips over the remaining portion of the item. It accumulates the results (the file listings) in a StringBuffer. (See Chapter 7, Basic Utility Classes for a discussion of StringBuffer.) For each file, we add a line of text listing the name, modification time, and size. When the listing is complete, getContent() returns the StringBuffer as a String object.
The main while loop continues as long as it's able to read another header record, and as long as the record's "name" field isn't full of ASCII null values. (The tar file format calls for the end of the archive to be padded with an empty header record, although most tar implementations don't seem to do this.) The while loop retrieves the name, size, and modification times as character strings from fields in the header. The most common tar format stores its numeric values in octal, as fixed-length ASCII strings. We extract the strings and use Integer.parseInt() to parse them.
After reading and parsing the header, x_tar skips over the data portion of the file and updates the variable count, which keeps track of the offset into the archive. The two lines following the initial skip account for tar's "blocking" of the data records. In other words, if the data portion of a file doesn't fit precisely into an integral number of blocks of RECORDLEN bytes, tar adds padding to make it fit.
Whew. Well, as I said, the details of parsing tar files are not really our main concern here. But x_tar does illustrate a few tricks of data manipulation in Java.
It may surprise you that we didn't have to provide a constructor; our content handler relies on its default constructor. We don't need to provide a constructor because there isn't anything for it to do. Java doesn't pass the class any argument information when it creates an instance of it. You might suspect that the URLConnection object would be a natural thing to provide at that point. However, when you are calling the constructor of a class that is loaded at run-time, you can't pass it any arguments, as we discussed in Chapter 5, Objects in Java.
When I began this discussion of content handlers, I showed a brief example of how our x_tar content handler would work for us. We need to make a few brief additions to that code in order to use our new handler and fetch URLs that point to .tar files. Since we're writing a standalone application, we're not only responsible for writing handlers that obey the package/class naming scheme we described earlier; we are also responsible for making our application use the naming scheme.
In a standalone application, the mapping between MIME types and content-handler class names is done by a special java.net.ContentHandlerFactory object we must install. The ContentHandlerFactory accepts a String containing a MIME type and returns the appropriate content handler. It's responsible for implementing the naming convention and creating an instance of our handler. Note that you don't need a content-handler factory if you are writing handlers for use by remote applications; a browser like HotJava, that loads content handlers over the Net, has its own content-handler factory.
To make absolutely clear what's happening, we'll provide a simple factory that knows only about our x_tar handler and install it at the beginning of our application:
import java.net.*; import java.io.*; class OurContentHandlerFactory implements ContentHandlerFactory { public ContentHandler createContentHandler(String mimetype) { if ( mimetype.equalsIgnoreCase( "application/x-tar" ) ) return new net.www.content.application.x_tar(); else return null; } } public class TarURLTest { public static void main (String [] args) throws Exception { URLConnection.setContentHandlerFactory(new OurContentHandlerFactory() ); URL url = new URL( args[0] ); String s = (String)url.getContent(); System.out.println( s ); } }
The class OurContentHandlerFactory implements the ContentHandlerFactory interface. It recognizes the MIME-type application/x-tar and returns a new instance of our x_tar handler. TarURLTest uses the static method URLConnection.setContentHandlerFactory() to install our new ContentHandlerFactory. After it's installed, our factory is called every time we retrieve the contents of a URL object. If it returns a null value, Java looks for handlers in a default location.[8]
[8] If we don't install a ContentHandlerFactory (or later, as we'll see a URLStreamHandlerFactory for protocol handlers), Java defaults to searching for a vendor-specific package name. If you have Sun's Java Development Kit, it searches for content handlers in the sun.net.www.content package hierarchy and protocol handler classes in the sun.net.www.protocol package hierarchy.
After installing the factory, TarURLTest reads a URL from the command line, opens that URL, and lists its contents. Now you have a portable tar command that can read its tar files from arbitrary locations on the Net. I'll confess that I was lazy about exception handling in this example. Of course, a real application would need to catch and handle the appropriate exceptions; but we already know how to do that.
A final design note. Our content handler returned the tar listing as a String. I don't want to harp on the point, but this isn't the only option. If we were writing a content handler to work in the context of a Web browser, we might want it to produce some kind of HTML object that might display the listing as hypertext. Again, knowing the right solution requires that we know what kind of object a browser expects to receive, and currently that's undefined.
In the next section, we'll turn the tables and look at protocol handlers. There we'll be building URLConnection objects and someone else will have the pleasure of reconstituting the data.
This HTML Help has been published using the chm2web software. |