The power of big data – like cloud computing and mobility – has emerged as a transformational technology force, but one that poses a host of planning questions for senior government agency officials. Peter Mell, a senior computer scientist for the National Institute of Standards and Technology, devoted many months assessing the potential and the pitfalls of big data for NIST. He recently shared what he learned and what executives need to understand about big data in an interview with AOL Government’s Wyatt Kash.
Mell outlined some of the misunderstandings and tradeoffs associated with large scale data sets agencies are likely to encounter as they move beyond classic relational databases. He also talked about the importance cloud computing plays in facilitating big data analytics. And he shared with our readers a comprehensive slide presentation that puts many of the questions about big data and related security implications into perspective.
AOL Government: What do agency executives need to understand about the growth of big data?
Peter Mell: Many people think that big data is just a lot of data. But there are some real technological challenges to overcome with the big data paradigm. In particular we are hitting the limits of our relational databases in many areas. And relational databases often don’t work well with unstructured data like English prose or even semi-structured data like an XML document where you don’t know what tag is coming next.
Also we are struggling in processing the volume and velocity of data. And the reason has to do with fundamental computer science limitations and in particular, fault tolerance among the different computers.
So what should agency executives keep in mind in terms of resource planning for big data?
Mell: First is not to get mesmerized by the marketing hype. Second, ask yourself, does my relational database give me the performance that I need with the data I have? If the answer is yes, even if it’s a little slow, then they have no business looking at big data. The reason I say that so emphatically is that big data is not the next generation of database technology. There are tradeoffs to be made.
When you move into the big data world, you are often giving up very powerful query mechanisms like SQL. You are also giving up perfect data consistency in transactions. What you are gaining is scalability and the ability to process unstructured data.
You mention tradeoffs. What kind of tradeoffs and what do executives need to consider?
Mell: One tradeoff involves consistency. It’s not so much that your data is inconsistent. It’s that it will be eventually consistent. Think of the banking system. I am on a trip to Europe, my wife deposits $100, five seconds later I check an ATM in Europe, and the $100 isn’t there. Oh no, what happened? Well, it’s okay. My $100 will eventually show up. It will be eventually consistent. I would argue if the banking industry can use big data technology, so can you if you need it. Ideally, big data will be consistent all the time.
In the big data world, there is now a very famous theorem called the CAP theorem: consistency, availability and partition-tolerance. I hesitate to ever mention theorems to people, but it really helps people understand the tradeoffs.
The idea is this: imagine a Venn diagram with three circles, consistency, availability and partition-tolerance. Partition-tolerance just means that in your distributed system that things can fail and the system still keeps working. The theorem proves that you can’t have all three perfectly. If things have to be allowed to fail and the system still keeps working, that means that you cannot have perfect data consistency and availability at the same time.
In the database world, they can give you perfect consistency, but that limits your availability or scalability. It’s interesting, you are actually allowed to relax the consistency just a little bit, not a lot, to achieve greater scalability.
Well, the big data vendors took this to a whole new extreme. They just went to the other side of the Venn diagram and they said we are going to offer amazing availability or scalability, knowing that the data is going to be consistent eventually, usually. That was great for many things.
And then people realized, let’s not be extremist; let’s not be too far one way or the other; maybe we need a balance. So we started developing systems where you have nodes and dials that you can turn and you can decide how much data consistency do I want, how much availability do I want — that you can dynamically tune, and make that tradeoff per the application, per your need. So some of the big data technologies allow you to explicitly tune it how you want it.
Some big data technologies will claim to give you perfect data consistency and scalability. And for such solutions you should start asking yourself, are they really resistant to component failure, because that’s the third circle in that CAP theorem.
For most of our high value data, though, we are going to still keep it in relational databases.
But when those don’t effectively work with the structure of the data that we are processing, or we have too much data, or we have a variety of data, then in order to process it we are going to have to make some tradeoffs. We are going to have to be willing to relax some of the data consistency to get that scalability.
Where might you be willing to relax consistency in order to achieve scalability?
An example of where I don’t really need data consistency (in real time) is tracking Twitter followers. I really don’t care if I have 200 followers or 210 followers. It will be nice to know in a week from now how many I really had, but right now I really don’t care. From Twitter’s standpoint, trying to give a precise answer for everybody, with millions of customers – it is possible that they couldn’t do it (in real time) because providing a perfectly consistent data feed, that would likely mean a loss in (real time) availability.
An example where data consistency might matter more goes back to the financial systems we talked about earlier, where money is moving and we are doing lots of quick transactions. So when you want to know how much money somebody has to pay for something, and you are going to take that money right now, you need to know they really have the money. So consistency matters a great deal. In truth, I don’t know in the real world how consistent the data really is or if it’s just an illusion that we have all grown accustomed to.
What else about big data do executives need to prepare for?
Mell: People are beginning to understand for the first time that big data isn’t just about a lot of data; that big data is enabling you to scale dramatically your analysis with a certain cost, with a limited query capability, with data that isn’t immediately consistent.
I think people are also understanding that there is a tradeoff between relational databases and big data technology. They are understanding why they would make that tradeoff – that big data technology isn’t just one thing; there are different choices they have to make.
For example there are different approaches to tackling big data: There are graph representation databases, key-value store databases, document store databases, and column-oriented databases among other approaches. Executives or program managers need to understand that they are going to really have to do their homework before they move forward in implementing one of these.
Certainly given that this is all fairly new, I would expect people to do lots of small pilots and try this out and see what they can do and incrementally grow. I won’t expect anybody to go out and immediately do some – large deployment.
That said, there are lots of cloud services now offering big data technology. So you don’t have to have people that know how to set this thing up or administer it. You can just go out to the cloud and use it.
I know there are all sorts of issues with data sensitivity, insecurity and so on. You have to take all that into account of course, but for prototyping something, for trying it out, those clouds are certainly available and which kind of leads to point out that big data and cloud sort of go hand in hand.
Big data builds on top of cloud computing because big data is all about distributing computing, with large numbers of computers working together. And one of the most efficient ways to administer large numbers of computers, working together, is a cloud.
As a government, we got all excited about consolidating our data centers with cloud computing. Then we realized that we needed cloud architectures. And now we are realizing that on top of these cloud architectures, we can build big data solutions to give us database capabilities that we don’t have today.
How do you see the market evolving with the need to manage the explosion of so much unstructured data?
Mell: Some people claim that 95% of the data that we produce in the world is unstructured data. How in the world do we analyze it? Big data technology offers rapid ways to parallelize the analysis of unstructured data, to look for needles in a haystack.
Typically, the big data approach that comes up is MapReduce, which is a limited functionality, parallel computation system or algorithm. But the point I want to make strongly is that with big data technology, you are not talking about one thing; it is really all of the approaches that are non-relational.
Would you briefly describe those different approaches?
Mell: For example, there are graph databases, where you represent your data using dots with lines between them. You label the dots and you label the lines to show relationships. This is very powerful when you want to model relationships among the population. The intelligence community makes great use of graph databases.
There are key-value stores, where you have some data you are going to store and you put a tag on it like a name. So I might store data with the name John Doe and the value might be he wears glasses. There is no schema, no table, no query language. Because it’s so simple it’s really, really fast and it can scale dramatically.
A regular database or relational database stores its data sequentially by record. So if we have a record that says, a person, their age, their hair color, their eye color, it stores Peter Mell, 40 years old, brown hair, brown eyes in that order. If you want to then look up Peter Mell, you get my hair color quite readily. But if you want to do statistics on everybody’s hair color, it’s not very efficient because you have all this other data interspersed between where you store the hair color on disc.
The column-oriented database stores all of the particular attribute altogether, so all the hair color together, and all of the eye color together, so it makes it very efficient for certain kinds of analysis, certain kinds of queries.
I should mention that relational databases in some cases can solve big data problems, it’s called embarrassingly parallel. Then you can use relational database and scale it almost linearly.
How do you see big data analytics products evolving?
Mell: Database vendors are starting to offer big data technology within their database products. So it’s going to be really confusing for the acquisition people, because there is not going to be clear demarcation between a relational database product and a big data product. They are all kind of going to mix together.
And people will purchase database products with different feature sets, depending on the product. On the big data side, big data vendors are adding in SQL sometimes to offer a more powerful query language. But that’s dangerous, because if you do even remotely complicated query it’s going to be horribly inefficient
Many of the big data solutions are being optimized for writing. They are assuming that you are going to write once and you are never going to change it. They are also assuming you are going to write in big chunks. So if you make those assumptions, you can make a very efficient database that allows you to write enormous amounts of information. Of course, changing it won’t be efficient, but the assumption is you are not going to change it.
Then you can use algorithms like MapReduce or other streaming algorithms for analyzing data on the fly, that allow you to process that data and do whatever analysis you need to do.
One of the concerns beginning to emerge is the lack of analysts who have the expertise to handle big data. How do you see it?
Mell: In the big data world there are really two types of people. There is what I call big data scientists. They are the truly smart people – mathematicians and statisticians that know how to analyze unstructured data and have fancy algorithms for it and I wish I was one of those people.
Then there are the computer science people that really want to solve computer science challenges: How do I store this data? How do I retrieve this data?
We have primarily talked about the computer science problems, the infrastructure problems. But there is a whole world now where we have this enormous amount of data and the challenge of how do we find the needle in the haystack.
The answer I think is just technological maturity.
There is a grueling desire for data scientists. People who want to have some job prospects that are currently in graduate school or college may want to focus on being a data scientist – with the ability to understand computer science algorithms for retrieving and analyzing data, but also having the math and the statistics of the data analytics to provide this kind of capability.
Evidently, we have very few people in the workforce competent to really do that. It’s an emerging discipline that’s greatly needed, so it’s an exciting field.