At a recent big data conference I realized that volume and value really matter if you are going to work with big data effectively.
I asked: How big is 64 Terabytes of memory in laymen terms? and was told the following:
  • If listening to CDs, a person would have to listen for 14 years straight to hear 64 TB of music.
  • 64 TB is 4 times the amount of memory that the IBM “Watson” was able to use when playing Jeopardy.
  • A spreadsheet with a one line entry for every person in the US – about 300 Million, 213,000 bytes of information could be stored for each person. Note: A typical email is only about 500 bytes, so 425,000 of these email equivalents could be stored for each person like name, address, multiple phone numbers, old addresses, social security number, all their grades in school, all credit card information, and lots more could be stored for each person.
Why is 64 Terabytes of memory important? If you are really going to work with big data effectively, it seems there are two ways to go:
  • Distribute it out across many computers (e.g. Google) or
  • Have it all be in memory (e.g. Cray XMP and Graph).
I also asked what is big data and was told the following:
  • Volume – the sheer amount of data (Capture and Store)
  • Velocity – the speed that data arrives (Sensor data, Market data, and Retail data)
  • Variety – the various types of disparate data (Structured, Unstructured, and Video)
I suggested the need for another V for value (see DHS experience with big data in my recent story), to decide if you want big data in the first place, and there was agreement that a Data Scientist/Statistician needs to determine what can be done with the data based on two basic types:
  • Statistical (method, model, predict) or
  • Found data (exploratory data analysis-needles, relationships, etc.).
I asked how are agencies and companies addressing big data problems and was told:
  • Bigger/Specialized hardware (Oracle’s Exadata, IBM Netezza, Cray XMP and Graph)
  • More hardware (Parallel processing – Hadoop) and Distributed computing (MapReduce, NoSQL architectures
  • Complexity of Management (Splunk)
This conference gave me the idea to make the conference proceedings an example of ISR (Intelligence, Surveillance, and Reconnaisance) big data with the different types of data that need to be integrated using the Intelligence Communities Catalyst-type tools following Gall’s Law – a simple system with big data that works can be scalled up to a complex system that works.

I did a semantically enhanced index of the Perfect Search (one of the big data vendors at the conference) web site as an example so I would understand what they do, integrate their unstructured and structured data, and search their entire web pages within the Google Chrome browser. I was very impressed with what the do to make search better for big data! One can also see from their work how search has been significanly improved based on implementing simple concepts like “separating the wheat from the chaff” and making them scalable to big data. One can see from my simple example how semantic analytics can be used to improve search.

So, one can search the entire big data conference proceedings (text, slides, video transcript, and my story) as a small scale example of what is being done on a massive scale with big data.

Clearly, with big data, volume and value really matter: You need a data scientist/statistican to first tell you if you have valuable enough data that warrants the technology you will need to deal with its volume in a scalable and affordable way.