Big data science visualizations have evolved from the use of proprietary data (past), to difficult-to-obtain-and-use big data (present), to the hope that business, finance, media, and government big data will be more readily available and useable in the future.

That future appears a ways off, however, given my experience with several recent projects and judging from some of the presentations at the just-concluded O’Reilly Strata Conference.

The Strata Conference is billed as the home of data science, that brings together practitioners, researchers, IT leaders and entrepreneurs to discuss big data, Hadoop, analytics, visualization and data markets.

I was especially interested in learning more about public domain data sets and how far they’ve evolved in their ability to to be used.

One way to look at that evolution is through an analysis of content from the past three Strata conferences, looking at the number of presentations by category. I was motivated to do this in part because of a previous story I read that government, real estate, and manufacturing had the highest value potential for big data in the future.

I found this year’s conference had expanded dramatically with the total number of presentations having essentially doubled in the past year, reflecting the surge in interest in the conference theme “Putting Data to Work.” Data science, for instance had nearly doubled, to 26 sessions doubled over 15 sessions on data at the prior conference.

Here’s a quick snapshot for comparison in the number and type of presentations:

Third Conference (February 2012) – 141 total presentations:

  • Data Science (26)
  • Business & Industry (14)
  • Visualization & Interface (13)
  • Keynotes (16)
  • Hadoop & Big Data: Applied (9)
  • Domain Data (6)
  • Deep Data (9)
  • JumpStart (13)
  • Hadoop & Big Data: Tech (11)
  • Policy and Privacy (4)
  • Sponsored Sessions (20).

Second Conference (September 2011) – 62 total presentations:

  • Business (9)
  • Data (15)
  • In Practice (1)
  • Interface (6)
  • Keynote (15)
  • Sponsored Sessions (11)
  • Policy & Ethics (3)
  • Real-time (2)

First Conference (February 2011) – 70 total presentations:

  • Executive Summit (10)
  • Business, Data, and Interfaces (1)
  • Disruption & Opportunity (7)
  • Interfaces (10)
  • Practitioner (22)
  • Real World (7)
  • The Data Business (3)
  • Other (10)

However the most recent conference was still mostly about methodology and technology and the promise of applications, except for the $3 million Heritage Health Data Prize.

The next conference will be held in connection with Hadoop World to reflect the considerable interest in Hadoop & Big Data and will focus on the big data issues that are shaping the business, finance, media, and government worlds.

Briefly, Hadoop is an emerging framework for Web 2.0 and enterprise businesses who are dealing with data deluge challenges – store, process and analyze large amounts of data as part of their business requirements. For example, IBM used the software as the engine for its Watson computer, which competed with the champions of TV game show Jeopardy.

Haddop includes the MapReduce algorithm, on which Google built its empire, and is explained in more detail for the “C-Suite” elsewhere.

Many experts believe Hadoop will subsume many of the data warehousing tasks presently done by traditional relational systems. This will be a huge shift in how IT apps are engineered.
But if my experience with other big data sets is any indication, we still have a way to go.

For instance, another example of big data sets can be found in “Places & Spaces: Mapping Science Exhibit,” which started in 2005 and is currently in its 8th iteration this year with “Science Maps for Kids.”

I entered the 7th iteration (2011) “Science Maps as Visual Interfaces to Digital Libraries” last year because the subject interested me as a data scientist and digital library builder.

It is an impressive effort that has led to the book Atlas of Science, but I found that most of the maps were based on proprietary data so it is essentially impossible to get those “big data” sets to work in your own visualization tools.

This is certainly not what the Panton Principles for Open Data in Science and the White House Open Government Data Initiative are about.

The Best Data Visualization Projects of 2011 is another example, which caught my eye, to see if any of those data sets were public domain.

Unfortunately those best data visualization projects were mostly of very big and difficult to obtain data as well.

So I’ve concluded that while big data science visualizations have evolved from use of proprietary data (past), to difficult to obtain and use big data (present), we still have quite a ways to go.

My recommendation, as I have discussed in an article elsewhere: Big data sets should “parsed and served up small.”