Archive

Archive for the ‘Tech’ Category

MySQL: TEXT vs. VARCHAR Performance

January 20th, 2011

Starting with MySQL 5.0.3, the maximum field length for VARCHAR fields was increased from 255 characters to 65,535 characters.  This is good news, as VARCHAR fields, as opposed to TEXT fields, are stored in-row for the MyISAM storage engine (InnoDB has different characteristics).  TEXT and BLOB fields are not stored in-row — they require a separate lookup (and a potential disk read) if their column is included in a SELECT clause.  Additionally, the inclusion of a TEXT or BLOB column in any sort will force the sort to use a disk-based temporary table, as the MEMORY (HEAP) storage engine, which is used for temporary tables, requires.  Thus, the benefits of using a VARCHAR field instead of TEXT for columns between 255 and 65k characters seem obvious at first glance in some scenarios: potentially less disk reads (for queries including the column, as there is no out-of-row data) and less writes (for queries with sorts including the column).

Following a review of my MySQL server instance’s performance metrics (mysqltuner.sh and mysql tuning-primer.sh point out interesting things), I found that approximately 33% of the server’s temporary tables used on-disk temporary tables (as opposed in-memory tables).  On-disk temporary tables can be created for several reasons, most notably if the resulting table will be larger than the minimum of MySQL’s tmp_table_size/max_heap_table_size variables OR when a TEXT/BLOB field is included in the sort.

Since any operation that uses the disk for sorting will be noticeably slower than using RAM, I started investigating why so many of my temporary tables went to disk.  One of my website’s most heavily used MyISAM tables contained two TEXT columns, which were textual descriptions of an object, submitted by the website’s visitors.  The table schema was created prior to MySQL v5.0.3, when VARCHAR fields were limited to 255 characters or less.  For this object, I wanted to allow descriptions larger than 255 characters, so the table utilized the two TEXT fields to store this data.

So my first thought in reducing the number of on-disk temporary tables was to convert these TEXT fields to VARCHAR.  I sampled the column’s values, and found that the two fields’ maximum input sizes were currently around 8KB.  Not thinking too much about what size I wanted to support, I decided that I could set both fields to VARCHAR(30000) instead of TEXT.  I changed the fields, verified through automated tests that everything still worked, and called it a night.

Over the next two days, I noticed that there were several alarming metrics trending the wrong way.  I utilize Cacti to monitor several hundred server health metrics — it is great for showing trends and alerting about changes.  Unfortunately, it was reporting that server load, page load time and disk activity were all up significantly — especially disk writes.  Wondering if it was a fluke, I left the server alone for another day to see if it would subside, but the high load and disk writes continued consistently.  The load was causing real user-perceived performance impacts for the website’s visitors, causing average page load time to increase from 70ms to 470ms.

Here’s what the Cacti graphs looked like:

Wouldn’t you be alarmed?

Not wanting to run an intensive performance review or diagnostics on the live website, I came up with a plan for how I would diagnose what the problem was:

  1. Enable temporary lightweight tracing on the server to try to determine the source of the increased disk activity (iotop or dstat).
  2. If MySQL is causing most of the disk activity, temporarily revert the VARCHAR(30000) columns back to TEXT, as I suspected they were somehow the cause of the slowdown.
  3. Perform a deeper analysis of the problem on a separate machine.

Running iotop on the server confirmed that a majority of the disk writes were coming from MySQL.  Additionally, after I reverted the columns to TEXT, the server load and page load times went back to normal.

So why did my seemingly obvious “performance win” end up as a performance dud?  Shame on me for not testing and verifying the changes before I pushed them live!

I didn’t want to do any more diagnosis of the problem on the live Linux server — there’s no point in punishing my visitors.  I have a separate Linux development server that I could have used to dig a little deeper, but I’m more comfortable doing performance analysis on Windows, and luckily, MySQL has Windows ports.

For almost all of my performance analysis work, I use the excellent Windows Performance Tools (WPT) Kit (using ETW and xperf).  If you haven’t used the WPT and xperf tools before, there are some good guides on trace capturing using ETW and visual analysis via xperfview on MSDN.  ETW is a lightweight run-time tracing environment for Windows.  ETW can trace both kernel (CPU, disk, network and other activity) and user mode (application) events, and saves them to a .ETL file that can later be processed.  The Windows Performance team and many other teams in Windows regularly use ETW/xperf for a majority of their performance analysis.

To figure out what’s going on with our VARCHAR columns, I first needed to ensure that I could replicate the problem on my Windows MySQL machine.  I installed MySQL 5 and loaded a snapshot of my live database into the instance.

I then looked at the MySQL Slow Query Log from the real server to see what queries were taking a long time (>1 second).  There were thousands of instances of a query that looked something like this:

SELECT   t2.*
FROM     table1 t1, table2 t2
WHERE    t2.t1id = t1.id
ORDER BY t1.id
LIMIT    0, 10

Which looks innocent enough, but Table2 is the table that I changed the two TEXT fields to VARCHAR, and I’m querying all of the columns (SELECT *) from it.  Before, because of the TEXT column, this query would’ve used an on-disk temporary table for the results (because the MySQL manual tells us this is the case for results that need temporary tables and have TEXT columns).  So why is this query appearing to be so much slower now?

First of all, I checked how this query responded on my Windows MySQL instance:

mysql> SELECT t2.*
FROM table1 t1, table2 t2
WHERE t2.t1id = t1.id
ORDER BY t1.id
LIMIT 0, 10;
...
10 rows in set (1.71 sec)

This confirmed the issue appeared on my development machine as well!  This query should be nearly instantaneous (less than 50ms), and 1,710 milliseconds is a long time to wait for a single query of many in a page load.

My guess at this point was it had something to do with temporary tables.  And disks.  Since that was what I was trying to improve with my TEXT to VARCHAR change, it only makes sense that I somehow made it worse.  So to validate this theory, I enabled a bit of lightweight ETW tracing to see how the OS handled the query.

I started ETW tracing to get basic CPU and disk IO information:

xperf -on base+fileio

Then I re-ran the above query in the MySQL command line, and saved the trace to disk (trace.etl) after the query finished:

xperf -d trace.etl

Loading the trace up in xperfview showed some interesting things:

xperfview trace.etl

From a CPU usage perspective, the CPU (on my quad-core system) seemed to be pretty active, but not 100% utilized.  Looking at the CPU Sampling Summary Table, I found that mysqld-nt.exe was using 1,830ms / 17% of my CPU (essentially 68% of one of the cores).  Not bad, but not maxed out either.  But what’s interesting here was the disk utilization graph.  For a period of ~700ms, we’re 100% utilized.  Highlighting that region and viewing the summary table showed where we spent our time:

#sql_1a00_0.MYD is a temporary table from MySQL (which can be confirmed from the File IO graph).  In this case, our single query caused 38MB of disk writes and ~626ms to write/read it back in.

Huh?

At this point, I wanted to double-check that the TEXT to VARCHAR change caused this.  I updated the column to TEXT, and re-run the same query:

mysql> SELECT t2.*
FROM table1 t1, table2 t2
WHERE t2.t1id = t1.id
ORDER BY t1.id
LIMIT 0, 10;
...
10 rows in set (0.03 sec)

Well, 0.03 seconds is a lot faster than 1.71 seconds.  This is promising.  I took another ETW trace of the query on with the TEXT field:

After switching back to TEXT fields, mysql used ~30ms of CPU and caused no disk activity.

Now that I knew what was causing the slowdown, I wanted to try to fully understand why this was the case.  Remember, I started down this path originally because I found that I had a high portion of temporary tables were on-disk temporary tables.  In the interest of seeing less disk activity on my server, I attempted to change several TEXT columns (which can cause on-disk temporary tables) to VARCHAR(30000) columns.  However, I didn’t fully look into what was causing the on-disk temporary tables, and instead just guessed.  As a result, my server’s perf tanked!

Now’s a good time to review the first paragraph of this post.  There are several reasons MySQL may use an internal temporary table.  One interesting quote:

Such a [temporary] table can be held in memory and processed by the MEMORY storage engine, or stored on disk and processed by the MyISAM storage engine.

and

If an internal temporary table is created initially as an in-memory table but becomes too large, MySQL automatically converts it to an on-disk table. The maximum size for in-memory temporary tables is the minimum of the tmp_table_size and max_heap_table_size values. This differs from MEMORY tables explicitly created with CREATE TABLE: such tables, the max_heap_table_size system variable determines how large the table is permitted to grow and there is no conversion to on-disk format.

So a temporary table can start out as a MEMORY table, then if MySQL realizes it’s too big for tmp_table_size and max_heap_table_size, it may convert it to a MyISAM table on the disk.  One caveat with the MEMORY engine is:

MEMORY tables use a fixed-length row-storage format. Variable-length types such as VARCHAR are stored using a fixed length.

How did this behavior affect me, and cause the 30ms query to take 1,710 ms?  Well let’s start out with the two newly VARCHAR(30000) columns.  In a normal MyISAM table, with a dynamically sized row, these two columns only take as much space as the data they contain (plus 1 byte).  That is, if I had a row and these two columns only had 10 bytes of data in them, the row size would be (10+1)*2+[other columns].  However, if I happened to convert this MyISAM table to use fixed-length rows, or I was using the MEMORY storage engine, the row size would be 30000*2+[other columns].  Currently, according to my dataset, these dynamically sized rows only required an average of 1,648 bytes per row.

And that’s the crux of the problem.  My query above, simple enough, requires a temporary table to do its work.  We can verify this via the MySQL EXPLAIN command:

mysql> EXPLAIN SELECT t2.*
FROM table1 t1, table2 t2
WHERE t2.t1id = t1.id
ORDER BY t1.id
LIMIT 0, 25;
+----+-------------+-------+--------+---------------+---------+---------+------
| id | select_type | table | key     | rows | Extra                           |
+----+-------------+-------+--------+---------------+---------+---------+------
|  1 | SIMPLE      | t2    | NULL    | 8813 | Using temporary; Using filesort |
|  1 | SIMPLE      | t1    | PRIMARY | 1    | Using index                     |
+----+-------------+-------+--------+---------------+---------+---------+------

(I trimmed a couple columns to fit to the page’s width).

Here, we see our t2 table Using temporary.  MySQL converted the 8,813 dynamic-row columns to fixed-length, which expanded the VARCHARS to their full size: approximately 60,600 byes per-row.  That’s 8,813 rows * 60,600 bytes = 534,067,800 bytes to deal with!  The server’s tmp_table_size variables decided this wasn’t good for an in-memory temporary table, so MySQL ended up moving a lot of this work to disk.  As a result, we had ~700ms of disk writes with this query when using VARCHAR(30000) columns.

There are a couple ways to avoid this behavior in MyISAM tables:

  1. Use TEXT fields, with their known caveats.
  2. Use a smaller, more reasonable VARCHAR size.  These fields probably don’t need to hold more than 10k of data.  One could reduce their size to 10k or smaller, or even move them to another table and only retrieve them when necessary.
  3. Fiddle with the tmp_table_size and max_heap_table_size variables.  These two variables dictate which queries use on-disk temporary tables, as described here.  They are set at approximately 35mb/16mb by default (on my Windows MySQL 5.1 instance).

I made a couple changes.  I changed one of the original 30k fields to 10k, and and changed the other one to 1k.  This reduced the potential row size in MEMORY temporary tables tremendously.  I also upped the tmp_table_size and max_heap_table_size variables to 128MB on my server.  The combination of these two changes ensured that the specific query above was no longer causing all of the performance issues (for now).  I should probably move the 10k field to another table (or back to TEXT) to be sure.

I probably didn’t need to use ETW and xperf here to look into things.  Since I was aware that the changes I made to the database had a high correlation with the slow-downs I was seeing, and reverting these changes fixed the issue, I could have probably figured out what was going on by reading the MySQL manual a bit more.  But I like to dig into things, and I think xperf can help visually communicate what’s going on with your system for problems like this.  Seeing how much blocking IO a single query can cause really sheds light on things!

Admittedly, the server MySQL is running on also hosts a web server and multiple sites.  A dedicated SQL server with fast, dedicated disks would help minimize problems like this.

One interesting note is that MySQL version 5.1 on Windows doesn’t have the same IO patterns as version 5.0 did – I see the same File IO for temporary tables with the VARCHAR fields, but not the same amount of disk activity.  This could mean that 5.1 memory maps or caches a lot of the on-disk temporary file and doesn’t end up actually writing it out to disk.  I am unsure why.

And again, InnoDB has different performance characteristics.  I did not do any testing on InnoDB.

So at the end of the day, my server is back to normal, and I’ve learned a little bit more about how MySQL works.  I hope you did too.

Share on Twitter

The Economist, and, The Kindle: Take 2

January 3rd, 2010

A while ago I had written about how you can get The Economist on your Kindle (and other e-readers) by running a simple PHP script that crawls the economist.com and generates a .mobi file that it emails it to your Kindle weekly. Unfortunately (though understandably), around July 2009 they locked out their This Week’s Print Edition website to only subscribers of their online and print editions.

With a little bit of work, I’ve updated the economist-to-kindle.php PHP script to handle logging into the economist.com’s website with your user-name and password so it can generate a Kindle version again:

https://github.com/nicjansma/economist-to-kindle

With this update, and if you’re a print edition subscriber, you should be able to get this week’s edition on your Kindle again.

Updated 2010/01/25: Several bugfixes, see comments for details.

Updated 2010/07/22: slifox and crosscode have made some great additions to the code and got it working with the Economist.com’s latest site structure. Check out crosscode’s latest version or read the comments for details.

Updated 2011/05/02: Based on crosscode’s latest version, I’ve updated the script on this site (https://github.com/nicjansma/economist-to-kindle) to work with recent Economist.com articles.

Updated 2011/07/26: Small update to work with the economist.com’s latest updates: https://github.com/nicjansma/economist-to-kindle

Updated 2012/01/04: I’ve moded this project to Github: https://github.com/nicjansma/economist-to-kindle. If you have any suggestions, find bugs, or want to contribute, please head there.

Share on Twitter

Todoist.com (and TodoistBackup.exe)

June 11th, 2009

After reading David Allen’s Getting Things Done: The Art of Stress-Free Productivity a few years back (great book!), I was inspired to change the way I managed my to-do list. For quite a while, I had been maintaining my lists of tasks, projects, and ideas in a discombobulated mess of sticky notes, whiteboards, todo.txt’s, Outlook tasks, and random files on my hard drive. I took some of the ideas from GTD and applied them in a way that worked for me. I settled on two technologies: a wiki (for notes, lists etc), and todoist.com for task management.

There are several good web sites and programs out there that aim to let you apply Allen’s GTD principals, such as Remember The Milk (RTM), Todoist, Toodledo, Outlook, Google Tasks, and even maintaining a todo.txt. One of my favorite blogs is Lifehacker, which is focused on GTD stuff, and has covered todoist.com and RTM as well as similar sites. They’ve published a book called Lifehacker: 88 Tech Tricks to Turbocharge Your Day and a sequel Upgrade Your Life: The Lifehacker Guide to Working Smarter, Faster, Better, that are both fun reads. There’s been a lot of activity around GTD since the book was published, and many websites and apps attempt to provide a seamless way for people to manage their life in a GTD way.

After test-driving a few of the GTD web sites and programs, I settled on one that I’ve been using for 100% of my task management over the last three years: todoist.com. Todoist is a simple, Ajax’y site that allows you to organize your tasks in different projects:

Todoist.com

You can nest tasks and projects, prioritize, tag, colorize, and set absolute and recurring due dates. The interface is dead-simple and very quick. In the three years that I’ve been using the site, there’s only been a few moments of downtime. It manages hundreds of my tasks (all prioritized!) with ease. I honestly think I worry less these days, knowing that all of my tasks and projects are neatly organized. Yes, I am often slightly OCD.

However, I had a big concern that Todoist, which now stores 100% of my tasks, may disappeared without a trace some day. Luckily, Todoist provides a simple JSON API for retrieving all of your projects and tasks. With this API, I was able to build a simple C# app (TodoistBackup.exe) that does daily backups of my Todoist data, just in case Todoist were to ever disappear unexpectedly. (What to do with the data is another question entirely, but I’m sure I could deal with a text file until I found the time to write a replacement… :) ).

The program I created is called TodoistBackup. I am utilizing a C# library from James Newton-King called Json.NET to interface with todoist.com’s API, and save the output into XML. The program is dead simple to use: you simply specify your API “token” (basically a unique ID for each login, so you don’t have to share your username/password), and the output XML file name. For example, this command line saves my tasks to an XML file with today’s date:

TodoistBackup.exe [api token] "tasks-%date:~10,4%%date:~4,2%%date:~7,2%.xml"

The XML looks like this:

<todoist>
    <projects>
        <project Id="1" UserId="2" Name="Fun">
            <items>
                <item Id="1" UserId="2" ProjectId="1" Content="Have fun" />
                ...
            </items>
        </project>
        ...
    </projects>
</todoist>

The conversion from JSON to a XML archival format is incredibly easy in C# using JsonSerializer and XmlSerializer attributes. Check out how nice this is:

    /// <summary>
    /// Todoist project
    /// </summary>
    [JsonObject(MemberSerialization.OptIn)]
    [XmlRootAttribute(ElementName = "project")]
    public class TodoistProject
    {
        /// <summary>
        /// Initializes a new instance of the TodoistProject class.
        /// </summary>
        public TodoistProject()
        {
            Items = new List<TodoistItem>();
        }

        /// <summary>
        /// Gets or sets the project's Id
        /// </summary>
        /// <value>Project's Id</value>
        [JsonProperty("id")]
        [XmlAttribute]
        public int Id
        {
            get;
            set;
        }

        /// <summary>
        /// Gets or sets the user's Id
        /// </summary>
        /// <value>User's Id number</value>
        [JsonProperty("user_id")]
        [XmlAttribute]
        public int UserId
        {
            get;
            set;
        }

Very clean!

The program and source code for TodoistBackup is available on Github. If you have any suggestions, find bugs, or want to contribute, please head there.

Let me know if you use it!

Share on Twitter

The Economist, and, The Kindle

April 13th, 2009

For the last two years, I’ve been taking the bus to work — there’s a stop a block away from my house, with a direct route to the front door of my workplace.  One of the many things I’ve enjoyed about commuting via the bus is how much free time it gives me a week to read or listen to music.  I’ve always been a reader of books and magazines, but over the past few years the number of books I’ve finished has dwindled due to a lack of free time.

With the extra 5 hours of reading a week during the bus ride (a half-hour ride to and from work), I’ve been able to finish more books than I would otherwise, as well as keep current with some of my favorite magazines.  I started reading The Economist on and off during college, and I recently took the plunge and purchased a print subscription.  Which is no small commitment, as the best prices you can find for a print subscription are over $120 a year.  You can potentially get a subscription for $77 a year if you’re a student, which is an amazing deal: something that, after subscribing to the print edition for a year, I’d heartily recommend to anyone who qualifies.

I recently stumbled upon something amazing: the current week’s full print edition is available, online, for free: this week’s Economist print edition.  Reading past editions requires a paid economist.com subscription, which carries a decent yearly fee.  But this week’s edition is completely free.  I’m not going to gush anymore over how much I enjoy reading The Economist, but to say, that if I only subscribed to one magazine, The Economist would be it. (I also subscribe to Business Week, Wired and others).


After a lot of research, I also decided to purchase a Kindle 2.  I’m in love with it. It’s so tiny, which is wonderful for reducing the amount of weight in books and magazines I was carrying to and from work before.  I’ve been able to expand the range of books I can read on the bus because some of them (text books, Harry Potter, SciFi award compilations, etc) are just too darn thick to lug back and forth!  My Kindle has a dozen of my in-progress books on it right now.

Unfortunately, as of right now, The Economist isn’t available on the Kindle.  After scouring the ‘tubes, I came across a big customer request discussion on Amazon.com.  Looks like hundreds of people want the same thing.  There are also rumors flying around that something’s in the works from The Economist, but nothing substantial yet.

More scouring found KindleFeeder.com (free for most users), which has a feed for The Economist articles, as well as Calibre, which is a really cool e-Book manager.  Calibre will automatically download the Economist print edition, and convert it to a format for your Kindle or similar reading device.

My searches finally brought me to this post on the blog Fat Knowledge, which was exactly what I was looking for: a PHP script that downloads the print edition from economist.com and converts it to Mobi format, which the Kindle can read.  A table of contents is generated, and full text and images from articles are available.  Really cool!

Since the source was available, I also customized it a bit to my needs:

  • emails my kindle (xxx@kindle.com) the resulting .mobi, so I get Thursday morning delivery of the latest Economist (via cron)
  • reformatted the PHP a bit, to better help me understand what the script was doing

Original source code from the Fat Knowledge blog is available, as well as my updates here:

economist-to-kindle.php at Github

After reading The Economist on my Kindle for a few weeks, I do find myself going back and forth between the Kindle format and the dead-tree edition.  There’s something about the magazine format that will probably get me to renew my print subscription for another year or so.  But it’s nice to also have it on the Kindle, especially a day or two before the print edition arrives!

Note: Depending on where you will run the script, you will also need MobiGen for Linux or PC.  The PC version is available from MobiPocket’s /dev/ website.  For some reason, their Developer website no longer links to the Linux version.  I eventually found a random link that has the Linux version on the ‘tubes (though I can’t remember where), so here is mobigen_linux v6.2 build 41 if you need it.

Update 7/9/2009: Amazon announces that The Economist is (officially) available on the Kindle, abeit at the full newsstand price ($10.49/mo) — more expensive than any other magazine available for the Kindle. Do they really expect people to pay full price for a digital edition? The Kindle edition is getting a ton of 1-star reviews because of this — hopefully this sends a message to Amazon/The Economist that they need to lower their price to be more competitive.

Update 1/3/2010: A new version of the script has been posted that uses your economist.com’s login credentials to get this week’s Print Edition. If you are not a subscriber to the online or print editions, you won’t be able to access the website.

Share on Twitter

b2evolution 0.9 to WordPress 2.7 Migration Script

April 1st, 2009

For the previous 5 years, nicj.net ran on a blogging platform called b2evolution, one of the hundreds of CMS (content-management systems) available.  b2evolution was pretty revolutionary for its time, and worked good as a platform in 2003.  Over the years, the version I had been running (0.9) became rather old and out-dated.  There were probably a few unpatched security holes in the version I was running, but I was getting frustrated at paying the “upgrade cost” to keep my website up-to-date with the latest releases.  Each new release took time to upgrade and invariably caused problems that I would have to spend time debugging.  Lately, there was a big problem with comment spam, which can be annoying to keep on top of.  b2evolution doesn’t have a good solution to reduce comment spam, whereas several newer blogging platforms use Akismet to help fight it.

So over the last year, I’ve been looking for a new blogging platform.  WordPress is the successor I chose.  WordPress is similar to b2volution (both spawned from the same parent), but what impresses me the most about WordPress is how clean the interface feels and how powerful the blogging platform is with all of the community plug-ins available.  Upgrades are seamless as well (as simple as a subversion ‘update’ command).

The problem was what to do with all of the old posts on nicj.net in the b2evolution MySQL database.  Not that anyone cares what I wrote during my junior year of college, but I was hoping to at least be able to archive some of those old posts as ‘private’ so I could have a journal that I could look back to at a later date.  I was sure someone had done b2evolution to WordPress posts/comment/users migration before, and sure-enough, there are several scripts out there:

The problem is, each script has its caveats: migrating from a specific b2evolution version to a specific WordPress version, and each has their own incompatabilities with my situation.

I finally settled on one from tumahler.com, and modified it a bit to suit my own migration needs.  Changes are:

  1. Supports b2evolution 0.9 to WordPress 2.7 migration
  2. Script can read from one b2evolution MySQL DB and write to a different WordPress MySQL DB
  3. Migrates posts marked as b2evolution private to WordPress private
  4. Sets categories for posts
  5. Only posts and comments are migrated (not users, metadata or categories)

So I’m adding my script to the mix of publically available b2evolution -> WordPress migration scripts, in case it will help anyone:

https://github.com/nicjansma/migrate-b2evolution-to-wordpress (b2evo 0.9 to WP 2.7 tested)

Please note: Only use this on a clean WordPress install, as it will delete any WP posts, comments and metadata before the migration of b2evo data.

Share on Twitter

The Code Book

October 18th, 2006

Just finished reading The Code Book, by Simon Singh. I loved it. Great history of ciphers and cryptography through the ages.

Some favorite select quotes:

Phil Zimmermann, author of PGP:

Cryptography used to be an obscure science, of little relevance to everyday life. Historically, it always had a special role in the military and diplomatic communications. But in the Information Age, cryptography is about political power and in particular, about the power relationship between a government and its people. It is about the right to privacy, freedom of speech, freedom of political association, freedom of the press, freedom from unreasonable search and seizure, freedom to be left alone.

In the past, if the government wanted to violate the privacy of ordinary citizens, it had to expend a certain amount of effort to intercept and steam open and read paper mail, or listen to and possibly transcribe spoken telephone conversations. This is analogous to catching fish with a hook and a line, one fish at a time. Fortunately for freedom and democracy, this kind of labour-intensive monitoring is not practical on a large scale. Today, electronic mail is gradually replacing conventional paper mail, and is soon to be the norm for everyone, not the novelty it is today. Unlike paper mail, e-mail messages are just too easy to intercept and scan for interesting keywords. This can be done easily, routinely, automatically, and undetectable on a grand scale. This is analogous to driftnet fishing – making a quantitative and qualitative Orwellian difference to the health of democracy.

Whitfield Diffie, pioneer of public-key cryptography:

In the 1790s, when the Bill of Rights was ratified, any two people could have a private conversation – with a certainty no one in the world enjoys today – by walking a few meters down the road and looking to see no one was hiding in the bushes. There were no recording devices, parabolic microphones, or laser interferometers bouncing off their eyeglasses. You will note that civilization survived. Many of us regard that period as a golden age in American political culture.

Ronald L. Rivest, of RSA fame. The Case against Regulating Encryption Technology:

But it is poor policy to clamp down indiscriminantly on a technology merely because some criminals might be able to use it to their advantage. For example, any U.S. citizen can freely buy a pair of gloves, even though some criminal might use them to commit a crime without leaving fingerprints. Anyone can freely buy a personal computer too, even though a burglar might use them to ransack a house without leaving fingerprints.

I rather like the glove analogy; let me expand on it a bit. Cryptography is a data protection technology just as gloves are a hand protection technology. Cryptography protects data from hackers, corporate spies and con artists, whereas gloves protect hands from cuts, scrapes, heat, cold and infection. The former can frustrate FBI wiretapping, and the latter can thwart FBI fingerprint analysis. Cryptography and gloves are both dirt-cheap and widely available. In fact, you can download good cryptographic software from the Internet for less than the price of a good pair of gloves!

Share on Twitter

Let’s talk about Wiki

May 12th, 2006

I like Wikis. What is a Wiki? Well, according to Wikipedia, it is “a website that allows users to easily add, remove, or otherwise edit all content, very quickly and easily, sometimes without the need for registration“. For example, you log onto a website, see that there’s a spelling error, hit the “Edit” button to correct it, and it’s instantly updated for everyone else. It’s a simple idea, but very powerful.

The website Wikipedia is a prime example of a Wiki — it is an online encyclopedia created by, and edited by, anyone and everyone. To date, there are over a million English articles on Wikipedia, and there are articles in hundreds of other languages as well. Anyonmous internet users create and edit these articles, keeping them up to date and accurate. I use Wikipedia almost as much as Google these days when I’m doing research. See below for other sites based around the idea of a Wiki.

A wiki allows for a community of people (often the internet community as a whole) to keep an up-to-date record of their knowledge.

So anyone, anonymously, can edit a wiki. Doesn’t this make Wikipedia, and similar sites prone to abuse, defacing, or inaccurate data? You bet’cha. But the great thing about a wiki is that as easy as it is to make changes, it is easy to revert these changes to a previous version (all old versions are saved, and there is good revision control). A wiki is only as strong as its’ community, but in the case of places like Wikipedia, there is a large global community of people who want to keep the content up to date, unbiased, and accurate.

There are also several other projects based around the idea of a Wiki. The Wikimedia Foundation has several free projects such as:
* Wikipedia – encyclopedia
* Wikibooks – textbooks and manuals
* Wikiquotes – quotes database
* Wikisource – source documents
* Wiktonary – dictionary
* Wikinews – news
* Wikispecies – directory of species

Very cool stuff.

I’ve also begun using a Wiki a personal organizer. I’ve converted much of My Documents into a wiki on a website that I own (that only I can read and edit). Why do this? Well, it provides me:
* A place to jot down notes on random things, like stuff that I might need to remember later. For example, notes on how to use X or Y, problems I’m working on at work (and how I solved them), how I did this or that, what I need to do later, ideas for projects, etc.
* Access from anywhere (home, work or my phone)
* Revision control (every change is saved)
* A simple interface to edit documents

I started using my wiki (I call it NiciWiki!) two weeks ago and I’m still using it quite a bit. It’s a bit slower than editing a document on my computer, but it provides the advantages above so I think it’s worth it.

There’s even a cool implementation of a wiki called TiddlyWiki that you can save on your hard drive — it doesn’t require any web server, and you can edit it any time (think storing it on a USB thumb drive in your pocket).

Share on Twitter

Fire drill!!!

April 8th, 2006

So there I am, sitting at my desk at work, and my computer explodes with the sound of BEEP BEEP BEEP BEEP BEEP BEEP BEEP BEEP BEEP BEEP BEEP BEEP BEEP BEEP BEEP BEEP BEEP BEEP BEEP BEEP BEEP BEEP BEEP BEEP BEEP BEEP BEEP BEEP BEEP BEEP BEEP BEEP BEEP. Rats! I had to quickly smack the power button on my computer before my officemate got annoyed (losing all my work!).

Welcome to the world of backwards compatability — the Bell Character. A remnant of MS-DOS, many years ago. See, back in the days when there were just characters on the screen, no fancy-smancy graphical interfaces, programmers wanted to be able to add a little spice to their programs. So they decided that whenever this one particular character, ASCII code 7 (^G or if you hit control-G) was displayed on the console, it would also make the computer beep. Kind of cool, right? You could alert the user to something important if you needed. Your computer doesn’t even need to have speakers attached, all computers come with a small buzzer to make that ‘beep’ sound.

This ‘feature’ still exists today. I was searching through files on my computer (with ‘findstr’ on the command prompt), and it accidently searched a non-text file. And this non-text file had the character 7 in it, hundreds of times. So when findstr printed it out, it decided to make a hundred lound beeping sounds, in a row. There is no way to stop this. You can try it for yourself. Hit Win-R, type ‘cmd’, hit enter, hit ctrl-G, then enter. Your computer will beep.

I don’t believe there is any way to turn it off.

Share on Twitter

What’s your backup plan?

March 15th, 2006

This sounds very cliche, but my life is very… digital. So many parts of my life revolve around bits of information stored on disks and drives in various places. Documents, pictures, music, emails, projects — the list goes on.

We live in an age where this is very common. But have you ever thought about what would happen if you lost some of it? All of it? How would you feel? Sure, losing some of your MP3s isn’t the end of the world. But what about the other stuff? Maybe I’m an outlier on the curve for this, but there is so much data that I have created (and continue to generate daily) that I want to keep forever.

How could you lose your data? Unfortunately, it’s pretty easy:

Problems:

  1. Media failure: CDs get scratched, and may only store your data reliably for a few years. DVDs tell a similar story.
  2. Hardware failure: Hard drives crash (the average lifetime is only a few years).
  3. User failure: “oops! I hit delete!”
  4. Catastrophic events (flood, fire) (but those are a bit harder to plan for).

I’ve always backed up bits and pieces to various places, but a few weeks ago I decided I needed to take a serious look at protecting all of this information. The problem will only get worse — I will continually produce data for the rest of my life, and most of it I will want to save. My “My Documents” folder is already 40GB and growing. Other data/media is nearly 500GB.

So what’s my backup plan? Let’s keep it simple — reduce the risks of Problems #1, #2 (media and hardware failure) and #3 (user error). Problem #4 is a bit harder, but there is an easy solution if you just want to plan for worst-case scenario.

But first, you have to decide what to backup. I can think of three categories of data:

Categories of Data:

  1. My personal documents. This includes school work, source code, emails, financial information, etc. I want to be able to save this stuff forever. Additionally, data from the websites needs to be backed up. Losing it would be catastrophic. This is priority 1.
  2. Media. Movies, music, pictures, TV shows. If I lost this stuff, not a big worry — I just lose time re-acquiring. I can imagine that in the future this won’t even be a problem (bandwidth will be irrelevant). So not mission-critical, but helpful if I had some sort of redundancy. Priority 2.
  3. Operating systems. When an OS crashes, it usually just costs time, but that can be one of the most frustrating experiences and ruin a whole day (or three). There are two classes of machines (for me): home and servers. I don’t mind losing my home machines for a day to do an OS rebuild, but losing my web server for a day costs revenue. Priority 3 for home machines, Priority 2 for servers.

A backup plan needs to be multi-tiered. One solution won’t work for every thing. So how do you protect yourself from losing data? What options are available?

Solutions:

  1. Media. CDs and DVDs.Advantages: Cheap (both media and burners). Portable.
    Negatives: Unreliable due to scratching and life expectancy. Easy to misplace. Low capacity (5GB for DVDs). Staleness of data (it’s a pain to do backups each week).

    This solution should work for most people if they don’t need a lot of stuff backed up.

  2. Hard Drives (External hard drive). A USB or Firewire drive that is connected when you want to backup your data.Advantages: Portable. High-capacity (500GB and growing). Fast.
    Negatives: Costly. Same un-reliability as media (3 years?). Potential to break (dropping). Same staleness problem as media unless you have it automated.

    Probably the best solution for most people who have more data than media will allow.

  3. RAID. RAID basically trades storage on one of your drives for redundancy (a backup). The best part is, the redundancy is handled automatically — if one drive crashes, the other drives have enough information that they will continue to work with no service interruptions.Advantages: High-capacity (multiple drives can work together). Fast (depending on implementation). High availability.
    Negatives: Costs more than drives alone (you need RAID controller cards). Not portable (hard to move from computer to computer). Lose capacity (trading space for redundancy).

    Probably overkill for many people, but excellent when you have to protect a lot of data in an environment that you need high-availability.

  4. Drives on other machines. Using free space on other computers (on a network) to backup data.Advantages: Cheap (utilize space that isn’t used elsewhere). Fast. Can be automated to provide daily (or better) backups.
    Negatives: You need extra computers to make this work.
  5. Revision control (Keeping older versions of documents in case you need to revert to an earlier version).This solution is mainly geared toward Problem #3, User error.Advantages: Changes can be removed.
    Negatives: Extra space is needed for each version.

    I utilize this for source code where I may be required to back-out a change if I find problems.

My Backup Plan

So what do I do?

My personal documents: Since this is Priority 1 for me, I have four levels of backup.

  1. My Documents resides on a 4 disk, 750GB RAID-5 array (4x 250GB Western Digital WD2500YD Enterprise Drives). These drives come with a 5-year warranty (most consumer drives carry 3 years). I can lose any one of 4 drives, and all of my data will still be safe.
  2. Bi-weekly, I backup to an external hard drive. Afterward, this is placed in a fire-proof safe.
  3. I’ve burned my most essential documents onto DVD and have sent them to my parents in another state. This protects against some catastrophic events (fire) — and if both my home and my parents go up in fire on the same day, I have more important things to worry about (the end of the world perhaps?).
  4. Revision control for my source code and some other documents, keeping all old versions.

My web server data, which contains things such as the web pages, databases and server configuration files. Also Priority 1. This data has three levels of backup.

  1. Important data from the web server is backed up nightly with a 7-day history.
  2. This data is backed up onto a second hard drive on the same machine nightly as well. The second drive is essentially a mirror of the first drive.
  3. Weekly, this data is backed up to my home server.

My media, which I would probably go “eh” if I lost it, but I’m utilizing free space from the above solutions to provide backup.

  1. I’ve copied much of the data I had on CD and DVD onto the RAID-5 array (about 300GB total).
  2. The rest is still stored on CDs and DVDs where I won’t touch them unless need-be.

My machine OSs. I have a home machine, a home server (which does backup jobs and TiVo stuff), and my web server (which powers this and other websites).

  1. My home machine’s OS is not backed up. I don’t mind having to reinstall Windows if need-be. If I was worried, I could switch to RAID-1.
  2. My home server backs up important configuration files to another drive nightly.
  3. My web server has two drives. The 2nd drive does a nightly mirror of the first drive, so it can replace the first drive with minimal downtime.
  4. My home server also downloads the web server’s important data weekly to its own backup drive.

The home machines are all backed up with UPS power supplies. UPS is very important for my home computer, which has 6 hard drives spinning constantly. When you lose power, some disks might have trouble “slowing down” and could incur errors. The UPS connects to the home machine and provides ~5 minutes of backup power (complete with the LCD monitor and networking if need-be), then automatically shuts Windows down gracefully.

Oh, I also backup my parent’s documents over the internet weekly — I don’t think they care all too much, but I’d like to make sure they’re safe, regardless.

So how could I lose it all? I’m sure there are flaws in my plan, but I’m protected on many levels against multiple problems. Nothing is perfect, but I do feel a lot safer knowing my data is safe.

Phew.

What do you do?

Share on Twitter