FileSet Mode: httpFilesetFetch (and httpDirParse)
The primary purpose of HTTP Client support is to provide a way to move remote files through firewalls. Typically, a firewall excludes all traffic in and from a server, except for port 80. All port 80 traffic is handled by a web server, thereby requiring all conversations to use the HTTP protocol. The system can connect to remote hosts on port 80 by existing FTP or HTTP protocols. FTP cannot be used to communicate in a system, as is required by the web server. The solution is HTTP.
HTTP Client is modeled after the existing FileSet FTP drivers, functioning in the same basic way. The majority of the differences are protocol level. FileSet FTP sends and receives messages remotely by transferring them as files, either several messages within a single file or a file for each message. HTTP functions similarly using GET or POST commands to retrieve documents and PUT command to send documents.
An outbound transfer is straightforward: the files are sent as files using the FTP protocol. Inbound is more complicated. A directory is given, and some subset of the contained files is retrieved by the FTP protocol. To retrieve the files, the driver must first read the contents of the directory and form a list of available files. This is where the similarity between FTP and HTTP ends. HTTP has no protocol-level directive for listing the directory contents. The most likely solution is to request the directory path as its URL, followed by a slash to indicate that this is a directory. HTTP servers typically respond to such requests with a page listing the files and subdirectories contained in the directory. This result is returned in formatted HTML. An HTML parser is required to properly turn this HTML content into a concise directory list that is usable by a system driver.
This example shows a typical HTTP server response to a directory contents request:
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 Final //EN">
<HTML>
<HEAD>
<TITLE>Index of /Movies/backup/dir1</TITLE>
<HEAD>
<BODY>
<H1>Index of /Movies/backup/dir1</H1>
<PRE><IMG SRC="/icons/blank.gif" ALT=" "> <A HREF="?N=D">Name</A>
<A> HREF="?M=A">Last modified</A> <A HREF="?S=A">Size</A><A HREF="?D=A">Description</A>
<HR>
<IMG SRC="/icons/folder.gif" ALT="[DIR]"> <A HREF="/Movies/backup/">Parent Directory</A> 26-Feb-2010 14:44
<IMG SRC="/icons/text.gif" ALT="[TXT]"> <A HRF="file1.html">file1.html</A> 26-Feb-2010 14:44 Ok
<IMG SRC="/icons/text.gif" ALT="[TXT]"> <A HRF="file2.html">file2.html</A> 26-Feb-2010 14:45 Ok
<IMG SRC="/icons/text.gif" ALT="[TXT]"> <A HRF="file3.html">file3.html</A> 26-Feb-2010 14:45 Ok
</PRE><HR>
<ADDRESS>Apache/1.3.12 Server at servername.domainname.com Port 80/ <ADDRESS></BODY></HTML>
In this example, the requested directory contains these files:
- file1.html
- file2.html
- file3.html
Parsing out the file names requires a pseudo-parse of the HTML by
scanning the text for A HREF
tags. It extracts from
those tags the enclosed quoted strings. Exceptions are made for quoted strings
starting with a question mark (?). This ensures that references such as ?N=D",which
is a hyperlink to reorder the list by name, are not included. Another exception is
made to omit the parent directory (A HREF="/Movies/backup/").
After a directory list is parsed from the contents page, the list is passed to the user. It is then acted upon by custom Tcl procedures, as in FileSet FTP. It is the responsibility of the user to differentiate files from subdirectories. This causes the user to be aware of subdirectories, in case this information is of use.
Not all web servers present these directory listings in a reasonably consistent way. Among Apache servers, this is fairly consistent, but other web servers occasionally generate slight variations. Therefore, the directory-list parsing task is accomplished through a standard Tcl procedure: httpDirParse. In this way, the procedure can be slightly modified as required to accommodate variations of directory-listing formats.