Java Web Applications, Part 2 (second draft)

by Steven J. Owens (unless otherwise attributed)

Preface

Java Web Applications, Part 0
Java Web Applications, Part 1
Java Web Applications, Part 2
Java Web Applications, Part 3
Java Web Applications, Part 4

Inside HTTP

Most people seem to take HTTP for granted, but I've found it's a really, really good idea to get a packet sniffer or a logging proxy and actually watch what the browser sends up to your server, and what your server sends back, and learn about HTTP.

This is especially true when you're doing something clever with the browser. It's even more true when you're trying to figure out why your cleverness isn't working! But I've found in general that being able to watch the actual protocol has given me a lot more of a feel for what's going on. I don't like "magic".

Mozilla Firebird has this nifty plug-in, called Live Http Headers, which displays the header content of the HTTP traffic as you use the browser. It's simpler and easier to use than tcpdump, but I still recommend using tcpdump or tcpflow to really watch the entire connection. Unfortunately this gets a little ugly when your browser is fetching binary objects, like gifs or jpegs.

Note: I've recently come across what two HTTP proxy logging tools that appear to be the sort of thing I like to use; Nettools (http://neilja.net/nettool/index.html) and ParosProxy (http://www.parosproxy.org/index.shtml). Both are implemented in java, so they're truly multiplatform and will run wherever you want to use them.

Most programmers have a basic idea of what HTTP is. It's stateless; the browser opens a new connection for each request, closes the connection when it's received the response, and renders the results for the user. All action is initiated from the browser to the server. There is no persistent connection with the server, there is no way for the server to get a handle on the browser, there is no way for the server to initiate any action to the browser.

This gets complicated a bit by cookies and HTTP 1.1 persistent connections, but the fundamental nature is still there.

When your browser makes an HTTP request, it opens a tcp/ip connection to port 80 on the webserver (or port 443 if it's an SSL server, with a URL that begins with "httpS:" instead of "http:"). A tcp/ip connection is like a stripped down telnet connection (telnet adds a very small set of escape characters on top of a regular tcp/ip connection). You can fake an HTTP request easily by telnetting to port 80 on a webserver and typing the appropriate HTTP comands in. In the example below, the stuff I actually type is in bold text. Any special keys, like the enter key, are typed in emphasized bold text, with <angle brackets>:

 00:39:07, puff@darksleep:/var/www/htdocs/notablog/article> telnet www.darksleep.com 80<enter>
 Trying 66.45.34.102...
 Connected to darksleep.com.
 Escape character is '^]'.
 GET / HTTP/1.0<enter>
 <enter>
 HTTP/1.1 200 OK
 Date: Thu, 26 Feb 2004 05:39:21 GMT
 Server: Apache/1.3.27 (Unix) Debian GNU/Linux PHP/4.2.3
 Last-Modified: Fri, 30 Jan 2004 23:19:18 GMT
 ETag: "c8695-796-401ae676"
 Accept-Ranges: bytes
 Content-Length: 1942
 Connection: close
 Content-Type: text/html; charset=iso-8859-1

<html> <head> <title>DarkSleep</title> </head> <body BGCOLOR="white"> <hr> <h2>Welcome to darksleep.com <a HREF="beehive"><img ALIGN="center" ALT="Beehive" BORDER=0 SRC="beehive.jpg" width="75" height="50"></a></h2> <p>Someday there might even be a website here.</p> <p><center><img ALT="Give me coffee and nobody gets hurt" SRC="constructioncoffee.gif"></center></p> <p>Meanwhile, here's some <a HREF="/puff/">stuff</a>, mostly random articles and essays I've written over the years.</p> <hr> Here's a nice <a HREF="everyossucks.mp3">song</a>. </body> </html> Connection closed by foreign host. 00:39:15, puff@darksleep:/var/www/htdocs/notablog/article>

A couple of things to note here:

First, note that I typed <enter> twice after my first command. A blank line indicates the break between the head and body of a request or response. Most requests (except POSTs) have no body, so that blank line is the end of the request.

Second, the HTTP request in this example is very simple, I just typed the HTTP GET command itself, with the url "/" and the mandatory HTTP version argument. I didn't fake any headers along with the request.

Normally the browser will send along a half-dozen or more headers with each. Cookies, for example, are sent along with each request to the location that set the cookie, whether the request has anything to do with the cookie or not.

If any of this sounds familiar, it should, because it's pretty much the format used for an internet email. Specifically, it's a MIME, which is the more modern standard for internet email format. An excellent, detailed explanation of MIME encoding is in chapters 3 and 4 of the O'Reilly Programming Internet Email. The short form is you have a set of header lines, a blank line, and then the body. Each header consists of name, a colon (:) and a value.

Like I said above, most HTTP requests don't have bodies. POST has a body. The seldom-implemented PUT has a body. I'm not sure what else does. A good place to find out would be at w3.org.

HTTP responses are also a MIME, and they almost always have a body, which contains the actual data asked for, the HTML tags for the page, the binary data for an image, or whatever.

Speaking of binary data, one thing to note; unlike standard RFC 1521 MIMEs, HTTP MIMEs don't bother to base64-encode the binary data. I guess this makes sense; since the binary data is going straight from the browser to the server (or in rare situations, vice-versa) they don't need to worry about some mail server mangling it.

There is, as near as I can tell, nothing that delimits the end of an HTTP request or response, except the fact that the client or the server didn't send any more data, or closed the connection. With HTTP 1.1, with persistent connections, you can set headers to keep the connection open and wait for more data to come down. HTTP 1.1 also supports "chunked encoding", where the response includes a chunk-size value and then the body is split up into chunks. Modern clients support this, but I don't think many web applications use it.

Browser-based uploads (officially multipart/form-data, rfc 1867, but everybody seems to just call it upload) are done by the browser sending a POST with an additional header (Content-Type: multipart/form-data). Instead of the normal query string encoding, the posted data is stored in something much more like a MIME with a binary attachment. Once again, the binary data itself isn't base64-encoded.

Most modern browsers open several simultaneous connections to the server when it's convenient, typically when they grab a page and the page has image tags for several images. When I first heard about HTTP 1.1 and persistent HTTP connections, I remember hearing some talk that browsers would keep the persistent HTTP connection open so they could grab the image data referred to by a page. I'm not sure that any browsers actually do this, however. I get the feeling they don't.

Parameters and Query Strings

Oddly enough, while HTTP operations (http://www.w3.org/Protocols/HTTP/) and URL encoding (http://www.w3.org/Addressing/URL/uri-spec.html) are well-specified, I can't seem to find any official document that specifies how CGI query strings with parameters should be constructed. Then again, I can't claim to have looked exhaustively.

Parameters are sent by the following process.

First, the parameter values are URL-encoded. URL-encoding and decoding can be done with java.net.URLEncoder and java.net.URLDecoder. For example, a tab becomes the characters %09, linefeed becomes %0A, and so forth. Spaces may be translated to plus (+) characters for some reason (instead of using a %nn code).

Then the parameter name string and value string are concatenated with an equals sign (=) between them:

name=value

Then multiple parameters are concatenated with an ampersand (&) between them:

name=value&name=value&name=value&name=value

I've also read that optionally you can use a semi-colon (;) to separate parameters, but I've never seen that done in practice, and a little quick testing shows that the servlet API, or at least Jakarta Tomcat (which is the reference implementation servlet engine) doesn't recognize semi-colon as a separator.

With a GET, the parameters are just glued onto the end of the URL, with a question mark (?) to separate the parameter string from the URL:

http://www.darksleep.com/notablog/format.cgi?article=Lexicon.txt

With a POST, on the other hand, the parameters are stuck in the body. I'm still working on finding a good example of a POST, but I left my packet sniffer in my other pants.

One thing I'm not entirely sure of is how POST is handled when you have a whole lot of parameter data. Does it never put in a newline? Does it just put a newline at one of the ampersands? Before or after the ampersand? I'll have to look into this.

The classic advice about GET and POST with HTML forms is to use GET for scripts or server-side actions that are idempotent, which means repeated invocations should have the same effect as a single invocation. (For example, pushing the "on" switch, versus pushing the switch once to turn it on, again to turn it off, etc).

GET parameter strings can end up in bookmarks. This tends to make security-conscious folks a bit paranoid, since one could bookmark a username and password parameter string, for example. This is a good thing to avoid with secure data. But then people go a little overboard and decide that GET is the anti-christ. Bear in mind that from a network-level point of view, GET and POST are nearly the same thing; the only difference is a few newline characters.

GET also was widely rumored, in the early days of the web, to be fraught with peril if you had a large amount of parameter data, due to a rumored widely circulated bug in some of the C text libraries. POST, on the other hand, is supposed to be able to handle large quantities of information (megabytes, in the case of multipart/form-data POSTs).

Bouncing Data Off The Browser

Browsers are mostly stateless, but there's a difference between mostly stateless and completely stateless. Everybody, sooner or later, finds it useful to push a little state to the browser and then get it back, usually somewhere else in the application, or on rare occasions in completely different applications.

In a nutshell, this is done by causing the browser to make a request to the other site. There are only a few things that cause the browser to make a request without requiring a user to click on something, and there are only a few ways to sneak data into that second request. Put these together, and the list you get is:

Most of these are pretty obvious to figure out.

Cookies

Everybody knows what a cookie is, but just for the sake of completeness, I'll define it: it's a little chunk of data that the web server asks the browser to hang onto for some length of time, and then some server side script uses it at a later point in time.

Go read up in your reference materials on cookies for attributes and such, but I'm going to briefly explain them, mainly because there are a couple nuances people usually miss.

Cookies are defined or redefined by the server by simply including a cookie header with any set of response headers. You can even do this in a redirect response.

include HTTP response with cookie header example here

Once you define a cookie, the browser just includes that cookie header in any request it sends to the same domain.

include HTTP request with cookie header example here

You can get more specific about it by using an attribute to tell the browser to only include the cookie header with requests to specific paths on the web server.

The cookie can only be included to requests to the same domain, to safeguard the user's privacy, but there are two important gotchas that people miss; image tags and subdomains.

Image tags in a page can point at entirely different sites. This is how nefarious types like doubleclick keep track of what sites you visit and build up a dossier of your interests. They have banner ad image tags on sites all over the place that point back to doubleclick.com. When your browser requests the banner image file directly from doubleclick, they can set a cookie. When an entirely unrelated site has another doubleclick ad banner, your browser happily includes the doubleclick.com cookie along with the request for that banner image file.

Typically this sort of slimy trick is combined with GET-style parameters in the URL in the image tag's source attribute. The GET parameter identifies what site the user is viewing, the cookie groups that together with the other sites the user viewed, and now Doubleclick knows way more about what websites you like to spend your time at.

Subdomains are much less often used, but can be useful if you're working on a complex web application for a company that has more than one site with the same root domain (e.g. they have bar.foo.com and baz.foo.com). Remember, if you set a cookie to have a domain of "foo.com", every time the browser requests something from "foo.com" it will include a copy of that cookie along wtih the request. You can also set the cookie with the domain ".foo.com" (note the leading period, that's important) and it will be included in requests to www.foo.com and also in reqeusts to bar.foo.com, baz.foo.com, etc. This can be handy if you control both servers, and want a convenient way to pass data from one server to the other for some reason. It's not super-useful; most of the time, if you control both ends, you'd be better off using redirects with parameters, or some back-end channel.

Caching and Double-Submitting and Redirects

It is, strictly speaking, completely impossible to prevent the user from intentionally double-submitting, and pretty much impossible to prevent the possibility of an accidental double-submit.

My advice is don't bother trying to stop it from happening at the client, make sure you code the server to cope, either by recognizing it as a double submit or by making double submits harmless.

However, what I do want to avoid is having the browser generate lots of double-submits when the user uses the back and forward buttons. It's been a few years since I figured this out, so maybe browsers have gotten saner. The only way I've found is to have the response to the submit be a redirect. The redirect doesn't show up in the browser history, though the browser automatically follows it anyway. The browser never pops up a dialog offering to re-post the data.

Continued in Part 3
See original (unformatted) article

Feedback

Verification Image:
Subject:
Your Email Address:
Confirm Address:
Please Post:
Copyright: By checking the "Please Post" checkbox you agree to having your feedback posted on notablog if the administrator decides it is appropriate content, and grant compilation copyright rights to the administrator.
Message Content: