2012-12-11 »
Let's think about small data for a change
There's a lot of buzz lately about "big data" - huge, Internet scale databases that take a long time to query, but are awesome, I guess, because of how much testosterone you need to have in order to possess one.
Let's leave aside the question of exactly how big big data needs to be. I've heard of people talking about databases as small as a gigabyte as being "big," I guess because downloading it would take a few minutes, and maybe 'grep' isn't the best way to query it. Other people would say a terabyte is big, or a petabyte.
I don't really care. All I know is, beyond a certain threshold that depends on the current state of computer technology, as soon as your data is "big," queries are slow. Maybe a few seconds, which isn't so bad, but maybe a few minutes. And when things get slow, you start having to mess with having separate "data warehouse" servers so that doing your big analytical queries don't bring down your whole database. And managing it all becomes a full time job for someone or many people or a whole company.
I happen to work for an employer that does that sort of thing a lot. And to be honest, I find it pretty boring. Perhaps I have a testosterone deficiency. It's not so much the bigness that bothers me: it's the waiting. I like my compile-test-debug cycle to be on the order of two seconds, but when SQL or mapreduce gets involved, it's more like two minutes.1
I know, cry for me, right? Two minutes of my life, all gone. But seriously, when you're trying to find trends and aren't quite sure what you're looking for and it takes a dozen tries, those two minutes can add up rapidly, especially when added to the 10 minutes of web browsing or email that inevitably ensues once I get bored waiting for the two minutes.
After worrying about this problem for a long time (years now, I guess), I think I've come up with a decent workaround. The trick is to divide your queries into multiple stages. At each stage, we reduce the total amount of data by a few orders of magnitude, and thus greatly decrease the cost of debugging a complex query.
Originally, I might have tried to make a single SQL query that goes from, say, a terabyte of data, down to 10 rows, and made a bar chart. Which 10 rows? Well, it takes a few tries to figure that out, or maybe a few hundred tries. But it's much faster if we set the initial goal instead to, say, only pull out the gigabyte I need from that terabyte. And then I can query a gigabyte instead. From there, I can reduce it to a megabyte, perhaps, which is easy enough to process in RAM without any kind of index or complexity or optimization.
That last part is what I want to talk to you about, because I've been working on a tool to do it. I call it afterquery, tagline: "The real fun starts after the Serious Analysts have gone home." Serious people write mapreduces and SQL queries. Serious people hire statisticians. Serious people have so much data that asking questions about the data requires a coffee break, but they get paid so much they don't have to care.
Afterquery is the opposite of Serious. It downloads the whole dataset into RAM on your web browser and processes it in javascript.
But what it lacks in seriousness, it makes up for in quick turnaround. Here's what it produces from 1582 rows of data I got from Statistics Canada:2
Those 1582 rows are the breakdown of Canada's 2011 GDP by province and NAICS (industry type) code. I used Afterquery to sum up the values across all provinces and make a pie chart by NAICS category group. Could I have just queried Statscan and gotten those summarized results in, say, 15 rows instead of 1582? Sure. But their web site is a bit clunky, and I have to redownload a CSV file every time, and it doesn't produce pretty graphs. 1582 rows is no problem to grunge through in javascript nowadays. It even works fine on my phone.
The real power, though, is that you can take the same data and present it in many ways. Here's another chart using exactly the same input data:
And here's a treemap chart. Treemaps are generally pretty hard to produce, because of the weird way you have to format the data. But afterquery helps reformat "normal" data into helpful treemap-compatible data. I really like this visualization; see if you can find who spends the most on mining, or who spends proportionally more on agriculture.
(Tip: click on one of the boxes in the treemap for a more detailed view. You can use that to find out what specific kinds of mining. Or click 'Edit' in the top-right of the chart box to customize the chart. For example, remove "GEO," from the treegroup to see overall results for Canada rather than by province.)
Some neat things about afterquery
- It's completely stateless, configured entirely using URL parameters.
- It just grabs a file using jsonp and applies the given transforms.
- It can show either data tables or charts, including gviz charts, timelines, and dygraphs.
- You can do multiple layers of "group by" operations very quickly.
How many times have you wanted to, say, produce two columns: the total
number of purchases, and the number of customers making purchases? To
do that in SQL you'd use subqueries, perhaps, like this:
select region, count(customer), sum(nsales), sum(value) from ( select region, customer, count(*) nsales, sum(value) from sales group by region, customer ) group by region
With afterquery you just write two successive "group" transformations:&group=region,customer;count(*),sum(value) &group=region
and it does the rest. - Anything that can be "group by" can be treegrouped, allowing you to easily draw a treemap chart.
- Because we don't care about performance optimization - optimization is for Serious Analysts! - we can make tradeoffs for consistency instead. So you can apply transformations, filters, limits, and groupings in whatever order you want (instead of SQL's really strict sequencing requirements). And 'order by' operations aren't scrambled by doing 'group by', and we can have handy aggregate functions like first(x) (the first matching x in each group), last(x) (last matching x in each group), and cat(x) (concatenation of all the x in each group). You can imagine that 'order by x' followed by 'first(x)' could be a very useful combination.
- A simple "pivot" operation that lets you turn multi-level groups into column headers, allowing you to easily produce interesting multi-line or multi-bar charts, or seriously dense (but useful) grids of numbers.
- Because it's stateless, every afterquery chart is user-editable; just click the 'Edit' link in the upper-right of any chart or table, adjust the transformations, and see the new result immediately. Then cut-and-paste the URL to someone in IM or email, or drop it in an iframe on a web page, and you're done.
- Because it runs in your web browser and uses jsonp, it uses your existing browser login credentials and cookies to get access to your data.3 And because it's purely client side javascript, your data never gets uploaded to a server.
Are we having fun yet? You can find afterquery on github or on Google Code. For documentation, look at its built-in help page.
Contributions or suggestions are welcome. You can join the brand new mailing list: afterquery on googlegroups.com.
Footnotes
1 Of course if you're really processing a lot of data, a data warehouse query can be much more than two minutes, but generally it's possible to find a subset to debug your query with that takes less time. Once the debugging is done, you can just run the big query and it doesn't really matter how long it takes.
2 Those 1582 rows are the output of Serious Analysts down at Statistics Canada, who have processed all the numbers from all the tax returns from all the companies in Canada, every year for decades. That's Big Data, all right.
3 Of course, using jsonp creates security concerns of its own. If I have time, I plan to write about those in another post because I believe it's possible to work around those problems and have the best of all worlds... but it's a little tricky. Meanwhile, grep the afterquery source for "oauth" :)
Why would you follow me on twitter? Use RSS.