Saturday, May 5, 2007

Census 2000 Summary File 3 now online, Summary File 1 completed

With the new server online (see previous post), I finally have the disk space I need to really start building out the available data sets - I'm going from about 55GB on my old machine to just about 1100GB on the new one. (Well, technically, the "old" machine (my desktop) is newer than the "new" one, but since it was the first gCensus server, let's keep calling it "old").

The first beneficiary of this storage largesse has been Summary File 1. I've expanded the coverage from California, Oregon, and Pennsylvania to cover all 50 states plus the District of Columbia.

The second major step was to add Summary File 3, which covers a lot of interesting economic and housing statistics such as median and aggregate income, housing prices, and housing facilities. Fortunately, the file structure between SF1 and SF3 is very similar, so I was able to re-use most of the import code that I had already written for SF1. The coverage for SF3 is the same as for SF1 - all 50 states + DC.

It turns out in the end that my estimates of disk consumption were off - way off. I had originally believed that it would take 750-1000GB to store all of Summary File 1. Instead, it's taking about 410GB to store both SF1 and SF3. While my friends in the theory group would call that "a small constant factor", I prefer to think of it as "a whole lot of space". Consequently, if anyone has ideas for large (nationwide or even worldwide) data sets that would be cool to import, I'd like to hear about them.

New gCensus server online!

I have a big update here, so I'm breaking it into several pieces. The first part - the new gCensus hardware is finally up and running! I got the replacement motherboard from Intel and, luckily enough, everything actually came up on the first try.

Thanks to the generous donation by Ken Schmidt of Steel in the Air, I now have a fourth 400GB hard drive in the gCensus server, for a total of 1.2TB RAID5 storage. That brings the current specs of the machine up to the following:

- Intel D955XBK motherboard
- Intel Pentium EE 955 CPU (dual-core 3.46GHz with Hyperthreading)
- Zalman CNPS9500 cooler
- 4x400GB Seagate HDD (3xSATA, 1xPATA)
- PC Power and Cooling Turbo-Cool 510ATX-SLI power supply
- Diamond Stealth 64 VRAM graphics

Everything except the PATA hard drive and the video card was a donation. I'd like to thank everyone who's helped me out with this equipment - I couldn't have done it without you!

Thursday, April 5, 2007

Documentation now online

I've finally written up the first part of the gCensus documentation - that detailing the database backend code. It's up in both PostScript and PDF format.

Sunday, April 1, 2007

Hardware diagnosis

After doing some testing, it looks like the power supply and motherboard died - which one triggered which is unclear. As far as I can tell, the other components (CPU, RAM, drives, video) all still work fine.

Loyd very graciously donated a replacement power supply that seems more than up to the task, and I've returned the motherboard to Intel for warranty replacement. With any luck, the replacement will be around in two weeks and I'll be able to get the new machine running again.

Tuesday, March 27, 2007

Hardware failures and gCensus GT downtime

At about 5PM today, my test server went up in smoke. I'm still trying to figure out exactly what failed, but in the meantime, gCensus-GT (which was hosted on that machine) will be unavailable.

Monday, March 26, 2007

gCensus-GT - Google Earth visualization for GeoTIFF files and more

Lisa Jordan, a geography professor at Florida State, recently pointed out to me that there's a wealth of GIS data out there in raster data formats that can't be viewed in the free Google Earth client. For example, Columbia University's Socioeconomic Data Analysis Center generates raster imagery for a variety of different data trends. Although it's possible get higher-resolution vector data (like the main gCensus app does), these files tend to be faster to render and provide a quick overview of relevant trends.

To fill this gap, I've just put out the gCensus-GT application (, which allows you to visualize these raster formats in the free version of Google Earth. In addition to the GeoTIFF format SEDAC generates, gCensus-GT should be able to render any raster format supported by the open-source GDAL library ( (but GeoTIFF is the only one I've had test data for). Comments are, as always, welcome.

If you're interested in doing this conversion on your home machine, I've released the core code behind it as a Python module named gdaltokmz; it's available at The module has a few dependencies (notably, the GDAL Python bindings and the ImageMagick graphics tools) and is licensed under the GPL.

Sunday, March 25, 2007

gCensus Beta version, with new features

Since feature requests are coming in, I've decided to maintain a parallel version of the gCensus app where I develop new features. You can find it here:

So far, the only new feature I've added in the beta client is the ability to map multiple regions in the same KML file. This lets you (for example) compare multiple regions while having the same bins apply to all of them, which you couldn't do before. There are some rough edges, like the unbounded growth of the top (status) pane, and incorrect titling in the KML file, but you're welcome to try it out.

Keep in mind that this is the dev site, so at any given time it might not work quite right. Of course, if you find bugs or have feature requests, send them to as always.

Friday, March 23, 2007

First post!

After hitting ExtremeTech, Slashdot, Digg, and a *ton* of other blogs in the past week, I've received quite an outpouring of interest about gCensus, so I figured I'd set up this blog to try to keep interested folk informed about the development work on gCensus.

Loyd dropped off his old PC to me last week, and I've been working on setting it up to be the new Right now I'm stalled, waiting for a new 400GB hard drive to arrive in the mail so that I can have a matched RAID 5 set in the machine - 800GB of storage.

Data Sources
Once I get the new gecensus online, I'm going to be aggressively adding more states' Summary File 1 data into the database. There are a couple problems here that I'd like to resolve - for example, the ESRI shapefiles at the block group and tract levels have some corrupt metadata that makes it impossible to identify certain areas correctly.

I'm also looking at adding the Summary File 3 (income, etc.) data. Since the data file format is very similar to Summary File 1, this might be doable without too much trouble - I still need to look at it more closely.

I've had generous offers to help with adding new data sets. I'm still trying to work out good projects for some of those who have offered, but one project I'm excited about is a tool that we're hoping will let gCensus import data from common GIS formats, so that I wouldn't have to write a one-off import script for every new format that comes in. One of the problems is that I'm not a GIS specialist (yet, anyway), so I'm not very familiar with the popular formats out there - if someone reading this can offer advice, that'd be great.

I'm interested in setting up some new capabilities on the frontend - like multidimensional data plotting, and some alternative means of visualization. I'd also love to be able to do a cleanup of the user interface, as it could stand to be prettier and implement backtracking in a useful manner. If you're a Web pro and want to help out with a public-service open-source project, let me know!

That's a pretty good summary of what's going on right now. gCensus is definitely in the growth phase - trying to set up some new collaborations, fork off a bunch of projects - so the near future should be exciting!

(P.S. - If anyone reading this is interested in supporting the project with hardware donations, my biggest needs are in the storage department. New hard drives, or a proper RAID controller that can dynamically grow arrays like the Areca 1210 or 1220, would be awesome. I don't really expect one of these to fall out of the sky, since they're not cheap...but that's the sort of thing that would be great to handle multi-terabytes.)