Interview Jean-François Marcotorchino

Jean-François Marcotorchino, Vice-President, Scientific Director, Secure Communications and Information Systems, Thales

What is Big data?

No official definition of the term exists. The influential McKinsey Global Institute defines it as “datasets whose size is beyond the ability of typical database software tools to capture, store, manage and analyse”. But the size of these datasets isn’t defined, since it will constantly evolve and varies across sectors.

Isn’t it just the latest buzzword?

Far from it. Database storage processes have scaled up — with giants of the Internet driving the process — to handle the increasing volumes of data that need to be stored. This has brought a paradigm shift in storage technologies, known as NoSQL (Not only SQL), a term that encompasses the new modes of storage designed to achieve a step change in performance over current SQL relational databases. SQL databases can’t support the data streaming that is now the norm in online commerce.

But some businesses, such as banking, insurance and online reservations require a level of consistency in search results that NoSQL is unable to provide across all architectures and implementations. So the leading database players like Oracle have rethought SQL, which still has a lot of fans, and fairly recently they devised NewSQL architectures as an extension of NoSQL. NewSQL supports high-speed transaction processing through an SQL interface, and its proponents claim it can achieve speeds up to 1,000 times faster than native SQL. NewSQL is positioning itself as a direct competitor, or possibly a complement, to NoSQL in terms of scalability. But it remains to be seen how well it actually performs on really large-scale web graphs like social media.

Alongside this pretty revolutionary shift in modes of storage, a number of analytical methodologies have also been adapted to scale up to new challenges in terms of algorithms. These technologies, increasingly known as Big analytics, are an extension of what used to be called data mining. Very different algorithms are needed depending on whether the structure of the population to be analysed is known, partially known, or unknown. In the first case, we may use statistical sampling processes that do not require exhaustive analysis of the dataset. We could call this "Big analytics by extension" or a "hypothesis-driven" approach. In the second case, we’re in a "data-driven" mode, meaning that the analysis is almost exhaustive and the algorithms need to be parallelised or linearised to achieve the level of scalability required.

How does Big data differ from existing approaches?

Before Big data came along, vast volumes of data were already being handled on extremely powerful computers in dedicated datacentres. But the term high-performance computing (HPC) doesn’t exactly cover what we mean by Big data because of the kinds of problems that need to be solved.

In these HPC datacentres, we’re talking about extremely specialised data and recurrent queries, which are resolved by multi-disciplinary teams using competing or complementary technologies.

Moreover, many of the areas addressed by HPC are already well known to the scientific community. We’re generally trying to solve problems, or at least get a handle on problems, that have already been identified by dedicated and highly qualified research teams. Big data is more about solving problems that are far less technical, using data that is far less specialised, for people who are far less expert.

Why has Big data become so important today? Is it a disruptive technology?

Yes, it’s a disruptive technology. The combination of clever new storage methods and new algorithms — preferably scalable ones — has brought a step change in data processing capabilities. In addition, vast quantities of data have already been stored, in ways that are now usable, so the potential for value creation is enormous.

What roles have social media played in the emergence of Big data?

Social networks like Facebook, Twitter and LinkedIn, and of course Google, have driven the emergence of the NoSQL standard for databases that provide access to the kind of reticular data that social media rely upon. The social web is a very large graph with billions of nodes. Social networks are undoubtedly some of the largest consumers of stored data today, alongside Google of course. In fact, some NoSQL technologies were pioneered by Google, Facebook, Twitter and Amazon in the first place.

What about cloud computing? How is this related to Big data?

The two are related but they're not the same thing. Certain cloud computing applications are major users of Big data technologies and supply large amounts of data to Big data storage environments. But Big data applications can also be developed without systematically using the cloud. Shared data and procedures is where the cloud comes into its own, and with the right security architectures in place, cloud computing is going to drive a huge amount of data traffic, especially transfers of open data. This in turn will call for smart storage solutions and indexing technologies to provide the extremely short latencies needed to make systems usable by multiple users at the same time. This is a key area of research and development for Thales.