Friday, March 26, 2010

HTTP Headers Part I

When analyzing web traffic, it is important know what you are reading. With the prevalence of botnets and malware almost anywhere a user goes on the internet, said knowledge of how to read the HTTP traffic may make the difference between catching an infection, and turning your network over to some botnet author with acne, pale skin, and too much RBBAC (Red Bull Blood Alcohol Content).

What I plan to put here now is just some basic information involving HTTP Headers. Time permitting and memory working, I am going to try to go a little deeper and cover other important aspects of this traffic, using WireShark screen shots. I think I would eventually like to expand this to include other web traffic protocols such as HTTPS, SSL(TLS), DNS, etc.

For anyone who may not know:
HTTP stands for Hypertext Transfer Protocol. According to the RFC, this is an "application-level protocol for distributed, collaborative, hypermedia information systems." It has been around since at least 1990 and its current version is 1.1. HTTP 1.1 is defined rather extensively in 2616 [1] (so I have no plan to summarize this entire RFC, just hitting what I think are important or misunderstood parts of the header as related to intrusion analysis).

HTTP communication involves a client and a server. A server in this instance does not have to be an actual Web/Domain/Mail server, but is any system that will respond, typically on ports 80 or 8080, to HTTP requests. Likewise, the client is any system that can send HTTP requests and handle HTTP responses.


HEADERS:
There are three basic types of HTTP Headers: general-header, request-header, response-header, and entity-header

Fields:
Connection - This field specifies options for a particular connection that MUST NOT be passed further by proxies. This field should NOT include end-to-end fields (Cache-Control is a great example given in the RFC). End-to-end headers are those headers that are necessary for the client/server communication, or that specify client/server communication that would be useless in any other form. The reason that the Cache-Control is a great example is that this field tells the client handler how long to keep a page cached, if at all, in addition to a few other parameters that are optional. The client would have to communicate all the way back to the server for a refresh. In the same sense, the server writes down the Cache-Control header to the client, so having this data parsed and dropped by proxies would make the Cache-Control field useless. As another note on this field, and one that is covered by the RFC, the "close" option for this header is important in that it signals that the connection must/will be closed after response completion.

Content-Encoding - This represents what type of coding (compression) is used on the data being transferred. The most common token I have seen is the gzip token, which indicates a file compressed with GZIP. There are four definitely registered values for this field. There can be more used, even private schemes unknown to anyone but a bad actor, as this fields tokens are only encouraged ("SHOULD") to register with IANA.

Content-Type - This is one of my more favorite header fields. This field, normally, will tell you what type of data to expect in that portion of the traffic. For example: image/jpeg would indicate a jpeg (picture) file is in the same stream of traffic. This also means that the start of the data should contain one of the jpeg file headers, JFIF for example, and not a file header from a different type, such as MZ for a Windows executable.

Content-Location - Another interesting field. This typically states where the requested resource is located. For example, if a client is requesting a pdf file from server.com, this field may also contain the relative URI to the resource, in this case it could be: http://someserver.com/someFile/BadGuy/bad.pdf. The interesting thing here is that I have seen exploits that will drop a temporary pdf file on the client machine and then use subsequent traffic to call that file...and the Content-Location will have the location as a folder on the client machine (c:\temp\bad.pdf). Additionally, this can also be used for re-directs, which I have scene a LOT with fake anti-virus issues. A connection between your box and IPA 1 may exist, a script may run or a button clicked that initiates a GET request for the malware and the Content-Location field will have something other than the server that was initially connected to.

Referer - Yes...it's spelled incorrectly. Furthermore, it can be programmatically set to a bad URI. This field is used in a request-header to "document" the location of where the Request-URI came from. Basically, it's "give me this object" (Request-URI) that I found the address to at "this site." Clicking on a link on the MLB home page that leads to the main Detroit tigers page would contain the Request-URI of "www.detroittigers.com" and a Referer of "www.mlb.com." This field is used to create track-backs links. The security concern here that every analyst should be aware of is: this field is not always accurate and could have been programmatically changed. Because this field can be used to create a list for optimized caching, it can by programmatically changed in order to have the URI "refreshed" from a bad actor.

Accept-Language - This indicates that language that the client would like the requested resources to be formatted in. This is an Internationalization (I18N) comparability issue, but does produce something interesting for analyst. If I see that a requester's Accept-Language token is set to "en-ca" and I know that Canada always tries to infiltrate my network, I would be more inclined to include this traffic in deeper research, eh. If nothing else, it would allow me to select multiple items for aggregation prior to analysis.


References:
[1] http://www.w3.org/Protocols/rfc2616/rfc2616.html