CGI programs for processing web forms can be written in any language, even shell scripts. The hardest part is parsing the input from the form.
That input is delivered using one of two methods, GET or
POST. The GET method adds the information
to the end of the URL, and the web server makes it available to the
CGI program in the environment variable, QUERY_STRING. The
POST method places it on the program's standard input.
Whichever method is used, the input is in the form
name=value, with multiple name-value pairs being
separated by ampersands, e.g.:
user=john&firstname=John&lastname=Doe.
To ensure safe transmission across 7-bit networks, and to
prevent ambiguity, all 8-bit characters and most
non-alphanumeric characters are converted to a 2-digit hex code
with a leading percent sign. For example, all dollar
signs, $, are converted to
%24. Spaces are converted to plus signs. A CGI program
must split the string using the ampersand as the delimiter, convert
plus signs to spaces and hex codes to ASCII characters, as well as
separate the values from the names.
When I wrote my first CGI scripts, more than 10 years ago, I
used some CGI utilities, written in C, to do the parsing and hex
conversion. Since then, I have found ways of doing it in the
shell itself. My latest version, which I wrote for my
revised Word
Finder and Anagram Solver recently, is not as portable as my
previous scripts (it requires bash version 3.1 or
later), but it is compact and flexible.
It is a single function, parse_query, that takes a
list of variable names as its arguments. Values will only be
assigned to the variables in the list. Since anyone can bypass a
web form and submit any name-value pairs they like to the
script, this ensures that malicious variable names have no
effect.
Here is the function without interruption; a trial run and a detailed, line-by-line explanation follow.
parse_query() #@ USAGE: parse_query var ...
{
local var val
local IFS='&'
vars="&$*&"
[ "$REQUEST_METHOD" = "POST" ] && read QUERY_STRING
set -f
for item in $QUERY_STRING; do
var=${item%%=*}
val=${item#*=}
val=${val//+/ }
case $vars in
*"&$var&"* )
case $val in
*%[0-9a-fA-F][0-9a-fA-F]*)
## Next line requires bash 3.1 or later
printf -v val "%b" "${val//\%/\\x}."
## Older bash, use: val=$( printf "%b" "${val//\%/\\x}." )
val=${val%.}
esac
eval "$var=\$val"
;;
esac
done
set +f
}
Trial Run
Put the parse_query function and the following
lines into a file and execute it:
unset REQUEST_METHOD ## just in case QUERY_STRING="name=Jane%20Doe&shell=bash"e=%22" parse_query name shell quote printf "%s\n" "name=$name" "shell=$shell" "quote=$quote" echo unset name shell quote parse_query name quote printf "%s\n" "name=$name" "shell=$shell" "quote=$quote"
The result should be:
name=Jane Doe shell=bash quote=" name=Jane Doe shell= quote="
How it works
I always use the POSIX form of function definition, rather than the
ksh variety, and I always (well, almost always) put a
comment with a character to aid grepping for
documentation. My good intentions to comment all my code often get
no further than that:
parse_query() #@ USAGE: parse_query var [var ...]
The variables var and val are used to hold
the name of the variable and its value respectively. They are made
local to the function and cannot be used as names in the web form:
local var val
The Internal Field Separator is set to an ampersand so that the
QUERY_STRING is broken up automatically:
local IFS='&'
When $* is used inside double quotes, the arguments are
separated by the first character of IFS;
vars will contain all the allowable variables, and each
will be preceded and followed by an ampersand:
vars="&$*&"
If the method used by the form is GET,
QUERY_STRING will already contain the form information.
If it uses POST, this line reads it into
QUERY_STRING:
[ "$REQUEST_METHOD" = "POST" ] && read QUERY_STRING
File globbing is normally performed when variables are expanded;
set -f turns it off. This shouldn't be necessary, since
any wildcard characters in QUERY_STRING will be in hex
format, but it doesn't hurt and is a good habit to get into:
set -f
Since IFS is set to an ampersand,
QUERY_STRING will be broken up into its constituent
name=value segments, and a for loop
processes each one in turn:
for item in $QUERY_STRING; do
The name of the variable is everything before the first equals sign:
var=${item%%=*}
Its value is everything after the first equals sign:
val=${item#*=}
Plus signs are converted to spaces using the non-standard parameter
expansion found in bash:
val=${val//+/ }
The case statement checks that $var is found between
ampersands in the $vars variable:
case $vars in
*"&$var&"* )
If it does, the nested case statement checks whether the
value contains a hex code:
case $val in
*%[0-9a-fA-F][0-9a-fA-F]*)
Here is where the conversion is done. There are four steps involved in this line:
- Percent signs are converted to backslash-x (
\x) using the parameter expansion mentioned above; - The built-in command
printf's%bspecifier converts the resulting hex code to the ASCII character; - a period is added to prevent command substitution stripping any trailing newlines; and
- the result of
printfvia command substitution is assigned toval:
val=$( printf "%b" "${val//\%/\\x}." )
The superfluous period is removed:
val=${val%.}
esac
Finally, the value is assigned to the variable:
eval "$var=\$val"
;;
esac
done
Purists will say that the next line is wrong; rather than turning
globbing back on, the function should have saved the state of the
options (opts=$-) before turning it off, and turn it on
only if it was on when the function was entered (case $opts in
*f*) ;; *) set +f;; esac). They'd be right, but as I never
turn off globbing except for specific commands, I can get away with
it.
set +f
}
Here is the unHTMLized function.