Search for "thr"

Search Results

Keyword: ‘thr’

breakup.js

April 14th, 2013

It’s not you, it’s me.

A few months ago I released a small JavaScript micro-framework: breakup.js

Serially enumerating over a collection (such as using async.forEachSeries()in Node.js or jQuery.each() in the browser) can lead to performance and responsiveness issues if processing or looping through the collection takes too long. In some browsers, enumerating over a large number of elements (or doing a lot of work on each element) may cause the browser to become unresponsive, and possibly prompt the user to stop running the script.

breakup.js helps solve this problem by breaking up the enumeration into time-based chunks, and yielding to the environment if a threshold of time has passed before continuing. This will help avoid a Long Running Script dialog in browsers as they are given a chance to update their UI. It is meant to be a simple, drop-in replacement for async.forEachSeries(). It also provides breakup.each() as a replacement for jQuery.each() (though the developer may have to modify code-flow to deal with the asynchronous nature of breakup.js).

breakup.js does this by keeping track of how much time the enumeration has taken after processing each item. If the enumeration time has passed a threshold (the default is 50ms, but this can be customized), the enumeration will yield before resuming. Yielding can be done immediately in environments that support it (such as process.nextTick() in Node.js and setImmediate() in modern browsers), and will fallback to a setTimeout(..., 4) in older browsers. This yield will allow the environment to do any UI and other processing work it wants to do. In browsers, this will help reduce the chance of a Long Running Script dialog.

breakup.js is primarily meant to be used in a browser environment, as Node.js code is already asynchronously driven. You won’t see a Long Running Script dialog in Node.js. However, you’re welcome to use the breakup Node.js module if you want have more control over how much time your enumerations take. For example, if you have thousands of items to enumerate and you want to process them lazily, you could set the threshold to 100ms with a 10000ms wait time and specify the forceYield parameter, so other work is prioritized.

Check it out on Github or via npm.

PngOutBatch: Optimize your PNGs by running PngOut multiple times

May 15th, 2012

5 comments

PngOut is a command-line tool that can losslessly reduce the file size of your PNGs. In many cases, it can reduce the size of a PNG by 10-15%. I’ve even seen some cases where it was able to reduce the file size by over 50%.

There are several other PNG compression utilties out there, such as pngcrush and AdvanceCOMP, but I’ve found PngOut to be the best optimizer most of the time.

There’s an excellent tutorial on PngOut for first-timers. Running PngOut is pretty easy, simply run it once agaist your PNG:

PngOut.exe [image.png]

However, to get the best optimization of your images, you can run PngOut multiple times with different block sizes (eg, /b1024) and randomized initial tables (/r).

There’s a commercial program, PngOutWin that can run through all of the block sizes using multiple CPU cores, but I wanted something free that I could run from the command line.

To aid in this, I created a simple DOS batch script that runs PngOut through 9 different block sizes (from 0 to 8192), with each block size run multiple times with random initial tables.

While the first iteration of PngOut does all of the heavy lifting, I’ve sometimes found that using the different block sizes can eek out a few extra bytes (sometimes 100-bytes or more than the initial pass). You may not care about optimizing your PNG to the absolute last byte possible, but I try to run any new PNGs ready for production in my websites and mobile apps through this batch script before they’re committed to the wild.

Running PngOutBatch is as easy as running PngOut:

PngOutBatch.cmd [image.png] [number of iterations per block size - defaults to 5]

PngOutBatch will show progress as it reduces the file size. Here’s a sample compressing the PNG logo from libpng.org:

Blocksize: 0
Iteration #1: Saved 2529 bytes
Iteration #2: No savings
Iteration #3: No savings
Iteration #4: No savings
Iteration #5: No savings
Blocksize: 128
Iteration #1: Saved 606 bytes
Iteration #2: Saved 10 bytes
Iteration #3: No savings
Iteration #4: Saved 2 bytes
Iteration #5: No savings
Blocksize: 192
Iteration #1: No savings
Iteration #2: No savings
Iteration #3: No savings
Iteration #4: No savings
Iteration #5: No savings
Blocksize: 256
Iteration #1: Saved 1 bytes
Iteration #2: No savings
Iteration #3: Saved 5 bytes
Iteration #4: Saved 11 bytes
Iteration #5: No savings
Blocksize: 512
Iteration #1: No savings
Iteration #2: No savings
Iteration #3: No savings
Iteration #4: No savings
Iteration #5: No savings
Blocksize: 1024
Iteration #1: No savings
Iteration #2: No savings
Iteration #3: No savings
Iteration #4: No savings
Iteration #5: No savings
Blocksize: 2048
Iteration #1: No savings
Iteration #2: No savings
Iteration #3: No savings
Iteration #4: No savings
Iteration #5: No savings
Blocksize: 4096
Iteration #1: No savings
Iteration #2: No savings
Iteration #3: No savings
Iteration #4: No savings
Iteration #5: No savings
Blocksize: 8192
Iteration #1: No savings
Iteration #2: No savings
Iteration #3: No savings
Iteration #4: No savings
Iteration #5: No savings
D:\temp\test.png: SUCCESS: 17260 bytes originally, 14096 bytes final: 3164 bytes saved

The first block size (0) reduced the file by 2529 bytes, then the 128-byte block size further reduced it by 606, 10 then 2 bytes. The 192-byte block size didn’t help, but a 256-byte block size reduced the file size by 1, 5 then 11 more bytes. Larger block sizes didn’t help, but at the end of the day we reduced the PNG by 3164 bytes (18%), and 635 bytes (25% more) than if we had only run it once.

The PngOutBatch.cmd script is hosted at Gist.Github if you want to use it or contribute changes.

DIY Cloud Backup using Amazon EC2 and EBS

February 20th, 2012

3 comments

I’ve created a small set of scripts that allows you to use Amazon Web Services to backup files to your own personal “cloud”. It’s available at GitHub for you to download or fork.

Features

Uses rsync over ssh to securely backup your Windows machines to Amazon’s EC2 (Elastic Compute Cloud) cloud, with persistent storage provided by Amazon EBS (Elastic Block Store)
Rsync efficiently mirrors your data to the cloud by only transmitting changed deltas, not entire files
An Amazon EC2 instance is used as a temporary server inside Amazon’s data center to backup your files, and it is only running while you are actively performing the rsync
An Amazon EBS volume holds your backup and is only attached during the rsync, though you could attach it to any other EC2 instance later for data retrieval, or snapshot it to S3 for point-in-time backup

Introduction

There are several online backup services available, from Mozy to Carbonite to Dropbox. They all provide various levels of backup services for little or no cost. They usually require you to run one of their apps on your machine, which backs up your files periodically to their “cloud” of storage.

While these services may suffice for the majority of people, you may wish to take a little more control of your backup process. For example, you are trusting their client app to do the right thing, and for your files to be stored securely in their data centers. They may also put limits on the rate they upload your backups, change their cost, or even go out of business.

On the other hand, one of the simplest tools to backup files is a program called rsync, which has been around for a long time. It efficiently transfers files over a network, and can be used to only transfer the parts of a file that have changed since the last sync. Rsync can be run on Linux or Windows machines through Cygwin. It can be run over SSH, so backups are performed with encryption. The problem is you need a Linux rsync server somewhere as the remote backup destination.

Instead of relying on one of the commercial backup services, I wanted to create a DIY backup “cloud” that I had complete control of. This script uses Amazon Web Services, a service from Amazon that offers on-demand compute instances (EC2) and storage volumes (EBS). It uses the amazingly simple, reliable and efficient rsync protocol to back up your documents quickly to Amazon’s data centers, only using an EC2 instance for the duration of the rsync. Your backups are stored on EBS volumes in Amazon’s data center, and you have complete control over them. By using this DIY method of backup, you get complete control of your backup experience. No upload rate-limiting, no client program constantly running on your computer. You can even do things like encrypt the volume you’re backing up to. NOTE: As of 2014-05-21, EBS volumes can be encrypted automatically.

The only service you’re paying for is Amazon EC2 and EBS, which is pretty cheap, and not likely to disappear any time soon. For example, my monthly EC2 costs for perfoming a weekly backup are less than a dollar, and EBS costs at this time are as cheap as $0.10/GB/mo.

These scripts are provided to give you a simple way to backup your files via rsync to Amazon’s infrastructure, and can be easily adapted to your needs.

How It Works

This script is a simple DOS batch script that can be run to launch an EC2 instance, perform the rsync, stop the instance, and check on the status of your instances.

After you’ve created your personal backup “cloud” (see Amazon Cloud Setup), and have the Required Tools, you simply run the amazon-cloud-backup.cmd -start to startup a new EC2 instance. Internally, this uses the Amazon API Developer Tools to start the instance via ec2-run-instances. There’s a custom bootscript for the instance, amazon-cloud-backup.bootscript.sh that works well with the Amazon Linux AMIs to enable root access to the machine over SSH (they initially only offer the user ec2-user SSH access). We need root access to perform the mount of the volume.

After the instance is started, the script attaches your personal EBS volume to the device. Its remote address is queried viaec2-describe-instances and SSH is used to mount the EBS volume to a backup point (eg, /backup). Once this is completed, your remote EC2 instance and EBS volume are ready for you to rsync.

To start the rsync, you simply need to run amazon-cloud-backup.cmd -rsync [options]. Rsync is started over SSH, and your files are backed up to the remote volume.

Once the backup is complete, you can stop the EC2 instance at any time by running amazon-cloud-backup.cmd -stop, or get the status of the instance by running amazon-cloud-backup.cmd -status. You can also check on the free space on the volume by running amazon-cloud-backup.cmd -volumestatus.

There are a couple things you will need to configure to set this all up. First you need to sign up for Amazon Web Services and generate the appropriate keys and certificates. Then you need a few helper programs on your machine, for example rsync.exe and ssh.exe. Finally, you need to set a few settings in amazon-cloud-backup.cmd so the backup is tailored to your keys and requirements.

Amazon “Cloud” Setup

To use this script, you need to have an Amazon Web Services account. You can sign up for one at https://aws.amazon.com/. Once you have an Amazon Web Services account, you will also need to sign up for Amazon EC2.

Once you have access to EC2, you will need to do the following.

Create a X.509 Certificate so we can enable API access to the Amazon Web Service API. You can get this in your Security Credentials page. Click on the X.509 Certificates tab, then Create a new Certificate. Download both the X.509 Private Key and Certificate files (pk-xyz.pem and cert-xyz.pem).
Determine which Amazon Region you want to work out of. See their Reference page for details. For example, I’m in the Pacific Northwest so I chose us-west-2 (Oregon) as the Region.
Create an EC2 Key Pair so you can log into your EC2 instance via SSH. You can do this in the AWS Management Console. Click on Create a Key Pair, name it (for example, “amazon-cloud-backup-rsync”) and download the .pem file.
Create an EBS Volume in the AWS Management Console. Click on Volumes and then Create Volume. You can create whatever size volume you want, though you should note that you will pay monthly charges for the volume size, not the size of your backed up files.
Determine which EC2 AMI (Amazon Machine Image) you want to use. I’m using the Amazon Linux AMI: EBS Backed 32-bit image. This is a Linux image provided and maintained by Amazon. You’ll need to pick the appropriate AMI ID for your region. If you do not use one of the Amazon-provided AMIs, you may need to modify amazon-cloud-backup.bootscript.sh for the backup to work.
Create a new EC2 Security Group that allows SSH access. In the AWS Management Console, under EC2, open the Security Groups pane. Select Create Security Group and name it “ssh” or something similar. Once added, edit its Inbound rules to allow port 22 from all sources “0.0.0.0/0”. If you know what your remote IP address is ahead of time, you could limit the source to that IP.
Launch an EC2 instance with the “ssh” Security Group. After you launch the instance, you can use the Attach Volume button in theVolumes pane to attach your new volume as /dev/sdb.

Log-in to your EC2 instance using ssh (see Required Toolsbelow) and fdisk the volume and create a filesystem. For example:

ssh -i my-rsync-key.pem ec2-user@ec2-1-2-3-4.us-west-1.compute.amazonaws.com
[ec2-user] sudo fdisk /dev/sdb
...
[ec2-user] sudo mkfs.ext4 /dev/sdb1

Your Amazon personal “Cloud” is now setup.

Many of the choices you’ve made in this section will need to be set as configuration options in the amazon-cloud-backup.cmd script.

Required Tools

You will need a couple tools on your Windows machine to perform the rsync backup and query the Amazon Web Services API.

First, you’ll need a few binaries (rsync.exe, ssh.exe) on your system to facilitate the ssh/rsync transfer. Cygwin can be used to accomplish this. You can easily install Cygwin from http://www.cygwin.com/. After installing, pluck a couple files from the bin/folder and put them into this directory. The binaries you need are:
```
rsync.exe
ssh.exe
sleep.exe
```
You may also need a couple libraries to ensure those binaries run:
```
cygcrypto-0.9.8.dll
cyggcc_s-1.dll
cygiconv-2.dll
cygintl-8.dll
cygpopt-0.dll
cygspp-0.dll
cygwin1.dll
cygz.dll
```
You will need the Amazon API Developer Tools, downloaded from http://aws.amazon.com/developertools/. Place them in a sub-directory called amazon-tools\

Script Configuration

Now you simply have to configure amazon-cloud-backup.cmd.

Most of the settings can be left at their defaults, but you will likely need to change the locations and name of your X.509 Certificate and EC2 Key Pair.

Usage

Once you’ve done the steps in Amazon “Cloud” Setup, Required Tools and Script Configuration, you just need to run the amazon-cloud-backup.cmd script.

These simple steps will launch your EC2 instance, perform the rsync, and then stop the instance.

amazon-cloud-backup.cmd -launch
amazon-cloud-backup.cmd -rsync
amazon-cloud-backup.cmd -stop

After -stop, your EC2 instance will stop and the EBS volume will be un-attached.

Source

The source code is available at GitHub. Feel free to send pull requests for improvements!

Auto-ban website spammers via the Apache access_log

January 24th, 2012

No comments

During the past few months, several of my websites have been the target of some sort of SPAM attack. After my getting alerted that my servers were under high load (from Cacti), I found that a small number of IP addresses were loading and re-loading or POSTing to the same pages over and over again. In one of the attacks, they were simply reloading a page several times a second from multiple IP addresses. In another attack, they were POSTing several megabytes of data to a form (which spent time validating the input), several times a second. I’m not sure of their motives – my guess is that they’re either trying to game search rankings (the POSTings) or someone with an improperly configured robot.

Since I didn’t have anything in-place to automatically drop requests from these rogue SPAMmers, the servers were coming under increasing load and causing real visitor’s page loads to slow down.

After looking at the server’s Apache’s access_log, I was able to narrow down the IPs causing the issue. With their IP, I simply created a few iptables rules to drop all packets from their IP addresses. Within a few seconds, the load on the server returned to normal.

I didn’t want to play catch-up the next time this happened, so I created a small script to automatically parse my server’s access_logs and auto-ban any IP address that appears to be doing inappropriate things.

The script is pretty simple. It uses tail to look at the last $LINESTOSEARCH lines of the access_log, grabs all of the IPs via awk, sorts and counts them via uniq, then looks to see if any of these IPs had loaded more than $THRESHOLD pages. If so, it does a quick query of iptables to see if the IP is already banned. If not, it adds a single INPUT rule to DROP packets from that IP.

Here’s the code:

#!/bin/bash

#
# Config
#

# if more than the threshold, the IP will be banned
THRESHOLD=100

# search this many recent lines of the access log
LINESTOSEARCH=50000

# term to search for
SEARCHTERM=POST

# logfile to search
LOGFILE=/var/log/httpd/access_log

# email to alert upon banning
ALERTEMAIL=foo@foo.com

#
# Get the last n lines of the access_log, and search for the term.  Sort and count by IP, outputting the IP if it's
# larger than the threshold.
#
for ip in `tail -n $LINESTOSEARCH $LOGFILE | grep "$SEARCHTERM" | awk "{print \\$1}" | sort | uniq -c | sort -rn | head -20 | awk "{if (\\$1 > $THRESHOLD) print \\$2}"`
do
    # Look in iptables to see if this IP is already banned
    if ! iptables -L INPUT -n | grep -q $ip
    then
        # Ban the IP
        iptables -A INPUT -s $ip -j DROP
        
        # Notify the alert email
        iptables -L -n | mail -s "Apache access_log banned '$SEARCHTERM': $ip" $ALERTEMAIL
    fi
done

You can put this in your crontab, so it runs every X minutes. The script will probably need root access to use iptables.

I have the script in /etc/cron.10minutes and a crontab entry to run all files in that directory every 10 minutes: /etc/crontab:
0,10,20,30,40,50 * * * * root run-parts /etc/cron.10minutes

Warning: Ensure that the $SEARCHTERM you use will not match a wide set of pages that at web crawler (for example, Google) would see. In my case, I set SEARCHTERM=POST, because I know that Google will not be posting to my website as all of the forms are excluded from crawling via robots.txt.

The full code is also available at Gist.GitHub if you want to fork or modify it. It’s a rather simplistic, brute-force approach to banning rogue IPs, but it has worked for my needs. You could easily update the script to be a bit smarter. If you do, let me know!

Amazon S3/CloudFront 304s stripping Cache-Control headers

October 7th, 2011

2 comments

TL;DR: Beware of relying on Cache-Control: max-age and Expires HTTP header fallback behavior on Amazon CloudFront. The Cache-Control header may get stripped on CloudFront 304s, and browsers will then have to fall back to whatever is in the Expires header. If that Expires date has passed, or if you never specified it, all subsequent requests for the resource will be conditionally validated by the browser.

Update 2011-12-18: The Amazon CloudFront team has fixed the issue!

The Problem

I was looking at my web server’s health metrics recently (via Cacti), and noticed a spike in outbound traffic and HTTP requests. Analytics logs didn’t show a large increase in visitors or page loads, and it looked like the additional traffic was simply an increase in requests for static images.

The Investigation

The static images in question have HTTP Cache headers set for 1 year into the future, so they can be easily cached by browsers and proxies per performance best-practices. The suggested way to set a far expiry date is by setting both the Cache-Control header (eg, Cache-Control: public, max-age=31536000), as well as an Expires header with a static date set for the same time in the future. The Expires header is a HTTP/1.0 header that sets a specific date, say Jan 1 2011, whereas Cache-Control is relative, in seconds. Theoretically, if both Cache-Control and Expires headers are sent, the Cache-Control header should take precedence, so it’s safe to additionally set Expires for fall-back cases.

This combination of caching header behavior works good if you are using Amazon’s CloudFront CDN, backed by static files on Amazon S3, which is what I use for several sites. The files are uploaded once to S3, and their HTTP headers are set at upload time. For the static images, I am uploading them with a 1-year max-age expiry and an Expires header 1 year from when they’re uploaded. For example, I uploaded an image to S3 on Oct 5 2010 with these headers:

Cache-Control: public, max-age=31536000
Expires: Thu, 05 Oct 2011 22:45:05 GMT

Theoretically, HTTP/1.1 clients (current web browsers) and even ancient HTTP/1.0 proxies should both be able to understand these headers. Even though the Expires header was for Oct 5 2011 (a couple days ago), Cache-Control should take precedence and the content should still be fresh for all current web browsers that recently downloaded the file. HTTP/1.0 proxies will only understand the Expires header, and they may want to conditionally validate the content if the date is past Oct 5 2011, but they should be a small part of HTTP accesses.

So my first thought was that the additional load on the server was from HTTP/1.0 proxies re-validating the already-expired content since I had set the content to expire in 1 year and that date had just passed. I should have set a much-further expiry in the first place — these images never change. To fix this, I could easily just re-upload the content with a much longer Expires (30 years from now should be sufficient).

However, as I was investigating the issue, I noticed via the F12 Developer Tools that IE9 was conditionally validating some of the already-expired images, even though the Cache-Control header should be taking precedence. Multiple images were being conditionally re-validated (incurring a HTTP request and 304 response), for every IE session. All of these images had Expires header date that recently passed.

After I cleared my IE browser cache, the problem no longer repro’d. It was only after I happened to F5 the page (refresh) that the past-Expires images were being conditionally requested again on subsequent navigations.

The Repro

Take, for example, this request of a static file on my webserver that expired back on Jan 1, 2010:

GET /test/test-public.txt HTTP/1.1
Accept: text/html, application/xhtml+xml, */*
Accept-Language: en-US
User-Agent: Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0)
Accept-Encoding: gzip, deflate
Connection: Keep-Alive
Host: cf.nicj.net

HTTP/1.0 200 OK Date: Sat, 08 Oct 2011 02:28:03 GMT
Cache-Control: public, max-age=946707779
Expires: Fri, 01 Jan 2010 00:00:00 GMT 
Last-Modified: Sat, 08 Oct 2011 02:25:58 GMT
ETag: "098f6bcd4621d373cade4e832627b4f6"
Accept-Ranges: bytes
Content-Type: text/plain
Content-Length: 4
Server: AmazonS3

IE and other modern browsers will download this content today, and treat it as fresh for 30 years (946,707,779 seconds), due to the Cache-Control header taking precedence over the Jan 1, 2010 Expires header.

The problem comes when, for whatever reason, a browser conditionally re-validates the content (via If-Modified-Since). Here are IE’s request headers and Amazon’s CloudFront response headers:

GET /test/test-public.txt HTTP/1.1
Accept: text/html, application/xhtml+xml, */*
Accept-Language: en-US
User-Agent: Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0)
Accept-Encoding: gzip, deflate
Connection: Keep-Alive
Host: cf.nicj.net
If-Modified-Since: Sat, 08 Oct 2011 02:28:03 GMT

HTTP/1.0 304 Not Modified
Date: Sat, 08 Oct 2011 02:31:54 GMT
Content-Type: text/plain
Expires: Fri, 01 Jan 2010 00:00:00 GMT
Last-Modified: Sat, 08 Oct 2011 02:25:58 GMT
ETag: "098f6bcd4621d373cade4e832627b4f6"
Age: 232

We see the additional If-Modifed-Since in the request, and the same Expires date in the response. Unfortunately, there’s an important missing header in this response: the Cache-Control header. It appears, at least from my testing, that CloudFront strips the Cache-Control headers from 304 responses.

After this happens it appears that IE forgets the original Cache-Control header so all subsequent navigations to the page will trigger conditional GETs for those resources. Since the 304 is missing the Cache-Control header, it just sees the Expires tag, and thinks it needs to always re-validate the content from now on.

Why This Would Happen

But what’s causing the re-validation (If-Modifed-Since) and subsequent 304 in the first place?

User agents shouldn’t normally re-validate these resources, since the original Cache-Control header should keep it fresh for quite a while. Except, when you either force a refresh (F5) of the page, or if the content has passed its natural freshness.

On F5 refresh, all resources on the page are conditionally re-validated via If-Modified-Since. And, as we’ve seen, the resources on CloudFront are sent back missing the original Cache-Control header, and IE updates its cache with just the Expires tag, instead of keeping the resource still fresh for a year. For some reason, this doesn’t occur with Chrome or Firefox on F5.

In addition, the problem will appear in all browsers when they need to send a If-Modified-Since header for re-validation of content they think might have expired, such as with max-age headers that have expired (shorter-expiring content).

Take, for example, a resource that you set to expire 1 day from now, and either set the Expires header to 1 day from now (per best practices) or simply don’t specify the Expires header:

Cache-Control: public, max-age=86400

For the first 24 hours after your visitor loads the resource, modern browsers won’t re-validate the resource. At hour 24 and 1 second, the browser will send a conditional request. Unfortunately, with CloudFront, the 304 response will be missing the Cache-Control header. The browser then doesn’t realize that the resource should be fresh for another 24 hours. So even if the content wasn’t actually updated after those 24 hours, all subsequent navigations with the resource will trigger a conditional validate of that resource, since the original Cache-Control headers were lost with the 304. Ugh.

How to Avoid the Issue

Note this doesn’t appear to affect Chrome 14 and FireFox 6 in the F5 scenario. Both browsers send conditional If-Modified-Since headers on F5 and get back the same CloudFront response (sans Cache-Control headers), but they don’t appear to be affected by the missing Cache-Control header. Subsequent navigations in Chrome and FF after a F5 do not conditionally re-validate the CloudFront content. They do appear to be affected by the missing Cache-Control header for naturally stale content on If-Modified-Since requests.

I haven’t investigated the F5 problem on pre-IE9 versions, but I would assume the problem exists there as well. As far as I can tell, this isn’t fixed in IE10 beta.

I’ve only found this problem on CloudFront’s CDN servers. I couldn’t find a way to get Apache to naturally skip the Cache-Control header for 304s if the header was in the original HTTP 200 response (for example, when using mod_expires on static content).

The bottom line is that requests that send an If-Modified-Since to CloudFront and get a 304 back will essentially lose the Cache-Control hints. If your Expires header is missing, or in the past, the resource will be conditionally validated on every page navigation until it gets evicted from the cache. That can cause a lot of unnecessary requests and will slow down your visitor’s page loads.

The simple solution is to use a much-further expiry time. 30 years should suffice. Then, if the original Cache-Control header is lost from CloudFront 304s, the 30-year-from-now Expires header will keep the resource from having to be validated.

I’m not sure why Amazon CloudFront strips the Cache-Control header from 304 responses. I’ll follow up with them.

Back to my original problem: I think it’s actually Amazon’s CloudFront servers noting that the Expires for a lot of my static images are past-due. They’re checking the origin server to see if any new content is available. The above issue isn’t likely causing a ton of additional load, but it was interesting to find none-the-less!

Update 2011-10-12: I’ve opened a thread on the Amazon CloudFront forums here. The team has responded saying they’re looking into the issue.