||[Mar. 30th, 2009|10:16 pm]
So my boss is interested in some stats on our data archive.|
I have access to a SQL database that has details of every single download transaction. So I can definitely pull some stats together. The question is... what?
I mean, I can slice and dice the data in a million different ways. I can count number of users, total data volume downloaded, popular files... what else would be interesting?
Jerry suggested looking at number of files downloads per user, which might show some interesting patterns.
The Downloader's IP addresses in Alphabetical order?
The Number of downloads from whitehouse.gov?
The Total budget for the agency divided by the total number of downloads?
Your server's rating in Hurtling Station Wagon Full Of Eight Track Tapes equivalent units.
I think it's actually A Lot.
Not only is that a measure I need to note myself for future use (thanks)...
but B's background-color plus your icon plus the angle of my mac laptop screen combine to make a really stunning translucent image. I realize it is prolly bog-standard Mail.app bits there, but you mind if I steal the idea with attribution? :)
No attribution needed, I just spent a few minutes futzing around to pull it out of an icon resource and into something postable.
Along the popular files - looking for ones that are popular over time versus in short bursts, ones that are popular to wide audience versus ones that a few people seem to revisit.
What is the geographical distribution of users?
Are there files that cluster together in time (especially thematically unrelated ones)?
Can you get to the user download patterns (e.g. searching versus browsing versus surveying)?
Are there time patterns of high and low activity? daily, annually, life of archive...
How soon after adding new content does someone access it? How long does it take to reach a "steady state" interest level?
Most of the time it is easier to figure out the stats to try and answer some questions, so what might those be? So if you are interested in stats based on users try to think of all the ways a user might want to use the data and look at the database from those lenses. Pick the ones that seem like you might get some meaningful interpretations.
Colored folders are always nice.
Why does your boss want stats?
If it's to say "Gosh, what a great data archive we have" when he goes looking for more funding, I'd look for volume stats like users and bytes and things like that, things that would make people go "Ooooh, ahhh" (like the Eight Track number)
If it's to figure out how to run the data archive more efficiently, then I'd go with some of the per-user numbers in the comments.
But first figure out the question being asked, then you'll have a better idea of what answers you should be providing.
Mostly it's for more funding, so yeah, "oooh ahhh" factor. Good!
Distribution of bandwidth use by distinct user?
Correlation? (Of users who looked at foo, 10% looked at bar..)
Hey, those are good ideas.
I am probably too late on this, but some offhand thoughts:
You want your data aggregation to construct and/or support a narrative. Perception is better when there is a narrative with some dramatic tension. So you need to decide how the people who will want to give your group funding will want to see things, and what dramatic tension they're interested in.
The tension they're interested in is probably some kind of buzzword or popular political movement that makes them feel like you're working on exactly what they've been talking about.
You group users and your data into "characters" which interact with each other and the dramatic tension in ways that advance the narrative.
Here's an arbitrary and fictional example:
Ten years ago there was a gap in atmospheric modeling which failed to factor in some technologies which were emerging in transportation and the energy grid. This caused people to be uncertain of how much or how little adoption or advancement of these technologies would benefit things in the long term, even though they seemed promising in the short term.
So people who work for or think about utilities and transportation began seeking the kind of data that would allow their models to include these kinds of changes. Thus, five years ago you see a spike in these people doing research. Over time, they began to refine their queries to a specific area of data which turned out to be incredibly pertinent, and allowed them to construct the models which directed funding and interest into long-term solutions and abandoned those which only appeared to have an effect on the problem.
Recently, you can see a similar trend with data regarding clean coal, and similar patterns are emerging, which we can expect to be repeated as long as there are new technologies for which measuring the long-term impact requires large amounts of data which can be analyzed in specific ways--exactly what you provide.
That kind of thing. But what you actually provide is a cross-section of the interests and/or employers of users combined with the data sets they tend to use. If, you know, that's the narrative and tension you've come up with.
Dear god that was longer than I expected. Sorry. Hope it's at least vaguely useful.