Dangerous paths - URI Design

ASP.NET is essentially an advanced request-processing framework. Naturally, the URI is the most important part of any request (or should be). URIs should be well designed, and should represent the request content accurately and succinctly.

Unfortunately, they are frequently misused, which causes browsers, users, and search engines no end of trouble.

Some misuse URIs by making them too generic; some sites have only the home page. Flash, AJAX, and frames are the biggest culprits here, as they are capable of making big changes to the current content of the page without affecting the address bar. Users of this type of site are frustrated because if they bookmark a buried page in the site, it only records the address of the home page. The back button also betrays them - it doesn't undo their actions anymore, but plops them completely off the site. Search engines dislike these sites because either (1) they can't access buried content due to its form (JavaScript or Flash) or (2) they can access it, but all keywords are diluted from the massive amount of content available on one page.

Some developers take the misuse to the opposite end. The feel that the address bar is the perfect place to store all variables, interface state data, and user preferences. They, too, cause problems for both users and search engines. Users bookmarking or e-mailing such links often find that they no longer work after their session has expired, or after a change was made on the site. Their length and lack of simplicity also makes them hard to understand, as many users depend on the address bar to understand where they are located on the site. Search engines find them confusing, because they see (and rank) each URI as a separate page, and dilute the ranking accordingly.

So, you ask, what makes a good URI?

It should be as short as possible. Don't sacrifice consistency or obviousness, but be brief.
Organize and name things logically. ASP.NET isn't always helpful in keeping a clean structure, so I highly recommend that you use a URL rewriting module. URIs should be 'hackable' - see http://www.useit.com/alertbox/990321.html.
URIs should be deterministic.
- No two URIs should ever display the same page
- The same URI should always display the same content.
The query string should only contain data that AFFECTS THE QUERY. If it doesn't describe the content, it doesn't belong.
The URI path should not rely on cryptic or numerical identifiers. If it does, it should also provide a human-readable title. It's really nice to be able to look at a URL and guess what it contains - especially when you have a long list of them. As a bonus, search engines absolutely love URIs that match keywords. Tip: Don't try to spam URLs with keywords. Density algorithms are applied here, also. As with page titles, pick exactly 1 keyword and stick with it.

Further reading (written by Tim Berners-Lee): http://www.w3.org/Provider/Style/URI.

Bad examples:

/Default.aspx?tabid=3
/Products/ShowProduct.aspx?prodid=4982
/showblog.aspx?articleid=98

Better examples:

/Default.aspx?tabid=3&title=ContactUs
/Products/ShowProduct.aspx?id=4982&product=Nokia_Wall_Adpater_12V
/showblog.aspx?articleid=98&title= Why_you_should_never_concatenate_SQL_commands

Best:

/contact/
/products/4982_Nokia_Wall_Adapter_12v
/blog/98_Why_you_should_never_concatenate_SQL_commands

URIs in the HTTP protocol

Let's look at how URI is sent to the server using HTTP

Here is a basic GET request. The first line consists of the HTTP method, followed by a root-relative path, then the protocol version. The subsequent lines contain the header collection, in the form of simple name-value pairs. The two parts of the URI here are the path (/blog?page=2), and the HOST-header (youngfoundations.org). We know that the scheme is probably "http" since we are communication using the HTTP protocol. IIS tells us which port the request arrived on, so between the pieces we can reconstruct the original URI somewhat accurately. Note: there are LOTS of schemes out there that use the HTTP protocol, like firefoxurl://, etc. Note: The HOST header is important, since some servers host dozens of domains, and this allows IIS to forward the request to the appropriate application in shared hosting situations. Multiple domains (hostnames) can be pointed to a single application.

The path and the query are divided by the first question mark.

GET /blog?page=2 HTTP/1.1[CRLF]
Host: youngfoundations.org[CRLF]
Connection: close[CRLF]
Accept-Encoding: gzip[CRLF]
Accept: text/xml,application/xml,application/xhtml+xml,text/html; q=0.9,text/plain; q=0.8,image/png,*/*;q=0.5[CRLF]
Accept-Language: en-us,en;q=0.5[CRLF]
Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7[CRLF]
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US; rv:1.8.1.5; .NET CLR 2.0.50727) Gecko/20070713 Firefox/2.0.0.5 Web-Sniffer/1.0.24[CRLF]
Referer: http://web-sniffer.net/[CRLF]
[CRLF]

Content can accompany any request, although it usually only accompanies the POST method. The header collection is separated from the request body by the character sequence CRLFCRLF (2 newlines). The content in the request body is described by the content-type and content-length HTTP headers.

POST /blog HTTP/1.1[CRLF]
Host: youngfoundations.org[CRLF]
Connection: close[CRLF]
Accept-Encoding: gzip[CRLF]
Accept: text/xml,application/xml,application/xhtml+xml,text/html; q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5[CRLF]
Accept-Language: en-us,en;q=0.5[CRLF]
Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7[CRLF]
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US; rv:1.8.1.5; .NET CLR 2.0.50727) Gecko/20070713 Firefox/2.0.0.5 Web-Sniffer/1.0.24[CRLF]
Referer: http://web-sniffer.net/[CRLF]
Content-type: text/html; charset=utf-8 [CRLF]
Content-length: 19[CRLF]
[CRLF]
Sample content body

The HTTP response generated by your ASP.NET application looks slightly different that the request that prompted it. The general format remains, but the first line is now [HTTP Version] [Status-code] [Status code description]. Http status codes are very important, but are beyond the scope of this article. See http://en.wikipedia.org/wiki/List_of_HTTP_status_codes for more information.

HTTP/1.1 301 Moved Permanently [CRLF]
Connection: close [CRLF]
Date:Fri, 03 Aug 2007 00:36:57 GMT [CRLF]
Server:Microsoft-IIS/6.0 [CRLF]
X-Powered-By:ASP.NET [CRLF]
Location:http://www.microsoft.com [CRLF]
Content-Length:31 [CRLF]
Content-Type:text/html [CRLF]
Set-Cookie:ASPSESSIONIDSSSBDQAT=PIJAGJDBFFLFAALAJDCGBAMI; path=/CRLFCache-control:private [CRLF]
[crlf]
[Content-body]

Important note: If you have multiple domains pointing to one website, make sure they are all 301 redirected to precisely one host name. Otherwise you will sabotage your search engine placement by (1) diluting your page rank, and (2) being penalized for duplicate content.

URIs versus URLs

The term URL (Uniform Resource Locator) has been considered obsolete for a long time. In its place stands the URI (the Uniform Resource Identifier). Strictly speaking, a URL must provide all of the information required to located and retrieve a resource, while a URI is only required to identify it in relation to the current context. Thus, a URL is a URI that "in addition to identifying a resource, [provides] a means of locating the resource by describing its primary access mechanism (e.g., its network 'location').". In common usage, however, both terms are synonymous. It is important, however , to differentiate 'complete' URIs (such as URLs) and incomplete, or relative, URIs.

For example, the following URIs are also URLs:

http://www.mysite.com:54321/ folder/virtualfolder/default.aspx? param1=thisisatest¶m2=test2

However, these are not:

../css/shared.css [URI relative to the location of the parent document]

/images/banner.jpg [URI relative to the current network location (usually termed 'absolute')]

Logo.gif [URI relative to the location of the parent document.]

#requirements [URI fragment relative to current document. Fragments describe a section, place, or entity in the current document. In HTML, they usually refer to a certain anchor tag (by name or ID). The window is usually scrolled to the location of the anchor tag. Fragments are never sent to the server computer, and only function as a display instruction to the client. If a fragment isn't understood, it is ignored. Fragments are pretty much free-form.

If the current document is http://mysite.com/home.html and a link to http://mysite.com/home.html#part3 is clicked, the browser (or user-agent), is not supposed to ask the server for http://mysite.com/home.html again, but older clients may. Relative fragments like #part3 are handled better.

Now let us dissect the following URL:

http://www.mysite.com:54321/folder/virtualfolder/default.aspx? param1=thisisatest¶m2=test2

http The scheme (protocol). The protocol determines how the client should talk to the server (basically the language, or grammar).

www.mysite.com The computer the resource is located on (DNS, WINS, or IP Address)

:54321 The port number to communicate with on the computer.

Instead of trying to sort out incoming packets and route them to the right application on the server computer, ports are used. Certain default ports are assumed for some protocols. Http requests are sent to port 80 by default. Https requests are sent to port 443, and FTP requests are sent to port 21. If an application is not listening on that port (or the request packets are blocked by a firewall), no response will be given.

Additional sorting is sometimes performed, as in the case of WCF (.NET 3.0) port sharing, or when multiple sites are hosted on a single server. When an HTTP request is sent to a server, it is accompanied by the original hostname from the address bar. An unlimited number of DNS (Domain Name System) addresses can point to a single computer, which is convenient for web hosting providers. IIS (Internet Information Services) can be configured to look at this host header, and forward the request to whichever site is configured to receive requests for that particular hostname (DNS address).

For information about DNS, read http://en.wikipedia.org/wiki/Domain_name_system.

Super-simplified view of DNS

DNS addresses are hierarchical, and levels (domains) are separated by a period. Domains progress from most specific to least specific.

For example, in resolving www.mysite.com, the following steps would be taken:

Ask computer 'COM' where computer 'MYSITE' is at (what its IP address is).

Ask computer MYSITE where computer 'WWW' is at.

DNS is used for a whole lot more that just web browsing, so the company at mysite.com might have a whole bunch of computers, such as ftp.mysite.com, mail.mysite.com, pop.mysite.com, telnet.mysite.com, as well as www.mysite.com. WWW usually points to the web server for the company. Please note, however that the WWW part is completely unnecessary, and is just a commonly followed convention.

Note: In www.mysite.com, "com" is a TLD (Top-level domain), and "mysite" is a SLD (Second-level domain)

SLDs usually cost a registration fee, as the poor owner of the "COM" computer has tremendous bandwidth bills. Third-level domains can be freely created if the parent SLD is under your control.

Paths

In the URI http://microsoft.com/default.aspx?tabid=2

The path is the portion of the URL after the third forward-slash.

Tutorial toolbar: Tell A Friend | Add to favorites | Feedback |