[Snort-devel] [hacker at ...104...: Unicode and IDS evasion]

Fyodor fygrave at ...1...
Mon Oct 30 05:27:31 EST 2000

Gee.. we definetely need to have some feedback from the hosts/systems which
we monitor if we'd want to stay effective with IDS. if we take
the same approach here as we took with http encoding (i.g. just
deploy plugin to decode it), we will miss numerous attacks related to UNICODE
as well. Integration with nessus database is one option, other thoughts?

What we could have is some sort of internal database (need to think of a
fast search method for it by IP address as a key)  to lookup for host-specific information.
Maybe some keyword(s) for rules (i.g. generic) could be introduced as well,
since we probably will need to point out some rules which do not require
host specific info... 

any thoughts?

----- Forwarded message from Eric Hacker <hacker at ...104...> -----

From: Eric Hacker <hacker at ...104...>
Date:         Sun, 29 Oct 2000 23:06:39 -0500
To: FOCUS-IDS at ...84...
Subject:      Unicode and IDS evasion
X-Mailer: Microsoft Outlook IMO, Build 9.0.2416 (9.0.2910.0)
Reply-To: Eric Hacker <hacker at ...104...>

Unicode and IDS evasion.

Everyone by now is aware of the IIS Unicode vulnerability [1]. Robert Graham
had mentioned in a previous post [2] that his company had a UTF8 parser
coded and released within hours of the announcement to catch this attack and
its variations.

I got to thinking about that. Wow, a UTF8 parser in a couple of hours? That
canPt be right, with all the different language options and whatnot, I doubt
IPd even be able to make a list of things to check in a few hours. Bruce
Schnier has warned about the complexities in Unicode processing and what it
might mean to security [3], now wePve seen a vulnerability. To get it right
in a few hours seems very difficult.

So I took a look at the write up by Network Ice on what they were looking
for. [4] They say:

   UTF8 is a multibyte character set, which means it can
   use one, two, or more bytes to represent a single
   character. It is used to represent non-English
   characters beyond the traditional 7-bit ASCII. In
   particular, it is used for far-east characters such
   as Chinese, Japanese, and Korean. In all, UTF8 can
   represent over 30,000 different characters.

Wow. 30,000 different characters! What does IIS do with all those? I started
looking for documentation from Microsoft to answer that, but still havenPt
come up with anything clear. I read on:

   By using multiple bytes to represent a traditional
   7-bit ASCII character, an intruder can evade an
   intrusion detection system (IDS) or compromise a web
   server by evading cononicalization/normalization.

Ah yes, evade an IDS. That was what I was thinking. Why everyone well
schooled in Ptacek and Newsham [5] or more recently Hoglund and Gary [6]
will realize the difficulty in keeping IDS synchronized with the server.
Unicode doesnPt make that any easier. I read on:

   Cononicalization/normalization is a process whereby
   a web server strips off the "backtracking"
   subdirectory of "../". By URL encoding the backtracking
   subdirectories, an intruder could bypass this process,
   and thereby access any file on the system.  ...

   With UTF8 encoded backtracking, it might look like:
   This alert triggers when such an attempt has been made.

First, I thought %C0%AF was a P/P and %C0%AE was the P.P, but I could have
crossed my test results. What really concerns me though, is that my
understanding of English would say that the above means that they only alert
on UTF8 encoded backtracking attempt. That doesnPt help much on the evasion
front. So then I thought maybe it wasnPt relevant for other characters,
because there was no other way to encode them.

Lacking a clear table indicating how IIS interprets UTF8, I did some
testing. I ran through some potential UTF8 codes on my unpatched W2K IIS
test server. I examined the logs to determine what IIS thought the URL was.
I found thirteen representations for the letter PaP. I tested all of these
and successfully retrieved a URL (http://myserver/a.txt) encoded with them.

That doesnPt bode well for IDS trying to monitor for hostile activity. If
a.txt was a vulnerability, would any IDS vendor have caught it?

A search of various (but not all) IDS vendorPs web sites does not bring up
any declaration of full Unicode support. Other than Network ICE, I really
didnPt see mention of it. It is unclear to me exactly what UTF8 characters
NetworkIce detects. I feel fairly confident that UTF8 encoding of an attack
would bypass most if not all network IDS today.

Do you want a code page with that?

IPm no IIS or W2K wizard, but from what I can tell, it seems that when W2K
is set up for different languages, then the interpretation of UTF8
characters will be different. I found this in the IIS documentation
regarding code pages:

   A code page can be represented in a table as a mapping
   of characters to single-byte values or multibyte
   values. Many code pages share the ASCII character set
   for characters in the range 0x00 - 0x7F.

If one thinks synchronizing the IDS to the way overlapping fragments are
dealt with by the TCP/IP stack of various OSs is difficult, try matching the
code pages the web server is using. This is sure to be a major hassle for
IDS monitoring of non-English versions of Microsoft IIS.

A light at the end of the tunnel?

All is not lost, however. Redirecting the focus of the IDS and standard web
server coding practices should alleviate much of this problem. I call it
client side normalization. I introduced it in my paper [8], here I will
apply it to the Unicode parsing problem.

A URL may consist of a resource location and perhaps data being submitted.
The data is typically preceded by a P?P. The resource location portion of a
URL is almost always generated by the web server owner. Web browsers do not
change the form of the resource location when submitting data. In this way I
claim that web browsers represent client side normalization.

Certainly this cannot be trusted. There are many ways for one to send any
URL one wants to a web server. However, the normal URLs being submitted will
be those previously generated by the server itself by normal web browsers.
If it was not in this set, it probably is an event of interest.

Web server owners should not create URLs that require reduction or
cononicalization. Why use a multi-byte character to represent PaP?
Therefore, when reduction or cononicalization takes place in the resource
location, it is an event worth noting. I repeat: the presence of reduction
or cononicalization within the resource location part of a URL is itself
anomalous, likely malicious, and worthy of a NIDS alert.

It is highly likely that the same can be said for the data portion of the
URL. Client side processing can be used to validate data within an accepted
character set. Again, this does not provide assurance. It does identify data
that contains non-standard encoding as anomalous.

Thus, by looking for reduction or cononicalization, itself, one can generate
alerts for likely malicious traffic and solve much of the Unicode problem.
Perhaps by using an overly inclusive character set that includes all
reduction possibilities from all languages one can avoid the language code
page problem. Otherwise the code page problem will require extensive
configurability within the IDS to protect international Unicode capable

Is it pattern matching or is it protocol analysis?

I have an interesting tidbit to add to all this. The other day I was reading
a draft white paper that was given to an acquaintance. The paper was
supposedly authored by someone at NetworkIce and was arguing that third
generation IDS that performed protocol analysis were much better than
earlier IDS that just did pattern matching. The acquaintance said, "It all
boils down to pattern matching in the end."

I read the paper. I had to agree. There was nothing in that paper that
identified any particular advantage of protocol analysis other than that it
was a better way to reduce data before pattern matching. Yes, it is a better
way to reduce data, but much of what the paper called protocol analysis is
done by Snort already.

Here however, is a clear advantage to protocol analysis. There is no way a
standard pattern matching IDS can perform Unicode reduction. Sure someone
could throw a protocol pre-processor into Snort to handle Unicode, but
detecting this type of attack canPt be done with a reasonably sized pattern
set. I can see no way of writing a pattern matching signature that will
detect Unicode within a port 80 stream.

I welcome any discussion on the ideas presented here. I recognize that my
limited experience does not encompass all of reality and that my judgments
may therefore be wrong. If you have evidence or ideas that suggest such,
please discuss.

Eric Hacker, GCIA, MCSE, CCSE
Lucent NPS, Security Practice

[1] http://www.securityfocus.com/bid/1806
[2] http://www.securityfocus.com/archive/96/140752
[3] http://www.counterpane.com/crypto-gram-0007.html#9
[4] http://www.networkice.com/advice/intrusions/2000639/default.htm
[5] http://www.robertgraham.com/mirror/Ptacek-Newsham-Evasion-98.html
[6] http://www.securityfocus.com/focus/ids/articles/desynch.html
[7] http://windows.microsoft.com/windows2000/en/server/iis/
[8] http://www.securityfocus.com/focus/ids/articles/resynch.html

----- End forwarded message -----

Said a horny young girl from Milpitas,
"My favorite sport is coitus."
	But a fullback from State
	Made her period late,
And now she has athlete's fetus.

More information about the Snort-devel mailing list