Redemption Steps | Writing Secure Code

The obvious thing to do is to never invoke a command interpreter of any sort . But, that isnt always practical, especially when using a database. Similarly, it would be just about as useful to say that if you do have to use a command shell, dont use any external data in it. That just isnt practical advice in most cases.

The only worthwhile answer is to do validation. The road to redemption is quite straightforward here:

Check the data to make sure it is okay.
Take an appropriate action when the data is invalid.

Data Validation

At the highest level, you have two choices. You can either validate everything youre going to ship off to the external process, or you can just validate the parts that are input from untrusted sources. Either one is fine, as long as youre thorough about it.

Its usually a good idea to validate external data right before you use it. There are a couple of reasons for this. First, it ensures that the data gets examined on every data path leading up to that use. Second, the semantics of the data are often best understood right before using the data. This allows you to be as accurate as possible with your input validation checks. It also is a good defense against the possibility of the data being modified in a bad way after the check.

Ultimately, however, a defense- in-depth strategy is best here. Its also good to check data as it comes in so that there is no risk of it being used without being checked elsewhere. Particularly if there are lots of places where the data can be abused, it might be easy to overlook a check in some places.

There are three prominent ways to determine data validity:

The deny-list approach ƒYou look for matches demonstrating that the data is invalid, and accept everything else as valid.
The allow-list approach ƒYou look for the set of valid data, and reject anything else (even if theres some chance it wasnt problematic ).
The quoting approach ƒYou transform data so that there cannot be anything unsafe.

All of these approaches have the drawback that you might forget something important. In the case of deny-lists and quoting, this could obviously have bad security implications. In fact, its unlikely that youll end up with secure software using a deny-list approach if youre passing the data to some kinds of systems (such as shells ), because the list of characters that can have special meaning is actually quite lengthy. For some systems, just about anything other than letters and digits can have a special meaning. Quoting is also much more difficult than one might think. For example, when one is writing code that performs quoting for some kinds of command processors, its common to take a string, and stick it in quotes. If youre not careful, attackers can just throw their own quotes in there. And, with some command processors, there are even metacharacters that have meaning inside a quoted string (this includes UNIX command shells).

To give you a sense of how difficult it can be, try to write down every UNIX shell metacharacter on your own. Include everything that may be taken as control, instead of data. How big is your list?

Our list includes every piece of punctuation except @, _, +, :, and the comma. And were not sure that those characters are universally safe. There might be shells where theyre not.

You may think you have some other characters that can never be interpreted with special meaning. A minus sign? That might be interpreted as signaling the start of a command-line option if its at the start of a word. How about the carat (^)? Did you know it does substitution? How about the % sign? While it might often be harmless when interpreted as a metacharacter, it is a metacharacter in some circumstances, because it does job control. The tilde (~) is similar in that it will, in some scenarios, expand to the home directory of a user if its at the start of a word, but otherwise it will not be considered a metacharacter. That could be an information leakage or worse , particularly if it is a vector for seeing a part of the file system that the program shouldnt be able to see. For example, you might stick your program in /home/blah/application, and then disallow double dots in the string. But the user might be able to access anything in /home/blah just by prefixing with ~blah.

Even spaces can be control characters, because they are used to semantically separate between arguments or commands. There are many types of spaces with this behavior, including tabs, new lines, carriage returns, form feeds, and vertical tabs.

Plus, there can be control characters like CTRL-D and the NULL character that can have undesirable effects.

All in all, its much easier to use an allow-list. If youre going to use a deny-list, youd better be incredibly sure youre covering all your bases. But, allow-lists alone may not be enough. Education is definitely necessary, because even if youre using an allow-list, you might allow spaces or tildes without realizing what might happen in your program from a security perspective.

Another issue with allow-lists is that you might have unhappy users because inputs that should be allowed arent. For example, you might not allow a + in an e-mail address, but find people who like to use them to differentiate who theyre giving their e-mail address to. Still, the allow-list approach is strongly preferable to the other two approaches.

Consider the case where you take a value from the user that youll treat as a filename. Lets say you do validation as such (this example is in Python):

 for char in filename:  if (not char in string.ascii_letters and not char in string.digits        and char <> '.'):  raise "InputValidationError"

This allows periods so that the user can type in files with extensions, but forgets about the underscore , which is common. But, with a deny-list approach, you might not have thought to disallow the slash, which would be bad; an attacker could use it plus the dots to access files elsewhere on the filesystem, beyond the current directory. With a quoting approach, you would have had to write a much more complex parsing routine.

Its common to use regular expressions to perform this kind of test. Regular expressions are easy to get wrong, however, especially when they become complex. If you want to handle nested constructs and such, forget about it.

Generally , from a security view, its better to be safe than sorry. Using regular expressions can lead to easy rather than safe practices, particularly when the most precise checks would require more complex semantic checking than a simple pattern match.

When a Check Fails

There are three general strategies to dealing with a failure. Theyre not even mutually exclusive. Its good to always do at least the first two:

Signal an error (of course, refuse to run the command as-is). Be careful how you report the error, however. If you just copy the bad data back, that could become the basis for a cross-site scripting attack. You also dont want to give the attacker too much information (particularly if the check uses run-time configuration data), so sometimes its best to simply say invalid character or some other vague response.
Log the error, including all relevant data. Be careful that the logging process doesnt itself become a point of attack; some logging systems accept formatting characters, and trying to naively log some data (such as carriage returns and linefeeds) could end up corrupting the log.
Modify the data to be valid, either replacing it with default values or transforming it.

We dont generally recommend the third option. Not only can you make a mistake, but also when you dont make a mistake, but the end user does, the semantics can be unexpected. Its easier to simply fail, and do so safely.