What is Data?

In our previous posts, we have talked about data science and data driven decision making. Though we did not address the basic question – what is data? It is obvious to the people who deal with data on a daily basis, who are data scientists themselves and who make their day-to-day decisions based on data. But, for many others, it is still a relevant question and for a better understanding and appreciation of data science as a stream, it should be addressed in right earnest.

Let’s start with the basics, the dictionary definition of data. According to Oxford English Dictionary, data is the plural of datum, which itself means ‘a piece of information’. Though as a mass noun, data is treated as singular and is defined as ‘ facts and statistics collected together for reference or analysis’. In one form, data is ‘the quantities, characters, or symbols on which operations are performed by a computer, which may be stored and transmitted in the form of electrical signals and recorded on magnetic, optical, or mechanical recording media’. In another form, data is ‘things known or assumed as facts, making the basis of reasoning or calculation’. And its synonyms are facts, figures, statistics, details, particulars, specifics, features.

In the Analytics space, Thomas Davenport in the book ‘Information Ecology’, defines data as “observations of states of the world”, which are easily structured, easily captured on machines, often quantified, and easily transferred. He also mentions that “the observing of such raw facts or quantifiable entities can be done by people or by the appropriate technology”.

Oxford English Dictionary’s definition of data as ‘things known or assumed as facts’ may be useful in philosophy for making the basis of reasoning. But this definition is problematic, if applied in the context of data science or decision science or Analytics projects. Because this definition implies that all pieces of information are data, whether known or just assumed to be facts! The problem arises as all the tools and techniques of data science and Analytics hinge on the availability of data, which are supposedly true to be facts, not just assumed to be facts!

To make the above point clear, let us review the perils of using assumed facts in a recent high visibility instance, when IBM Watson Reportedly Recommended Cancer Treatments That Were ‘Unsafe and Incorrect’. This news article states that “Internal company documents from IBM show that medical experts working with the company’s Watson supercomputer found “multiple examples of unsafe and incorrect treatment recommendations” when using the software”. It also says that “instead of feeding real patient data into the software—the doctors were reportedly feeding Watson hypothetical patients data, or “synthetic” case data”. It further mentions that “This would mean it’s possible that when other hospitals used the MSK-trained Watson for Oncology, doctors were receiving treatment recommendations guided by MSK doctors’ treatment preferences”. Here, MSK stands for ‘Memorial Sloan Kettering Cancer Center’.

It may be noted that IBM “Watson is a question-answering computer system capable of answering questions posed in natural language”, as per Wikipedia. This article also mentions that “in healthcare, Watson’s natural language, hypothesis generation, and evidence-based learning capabilities are being investigated to see how Watson may contribute to clinical decision support systems and the increase in Artificial intelligence in healthcare for use by medical professionals”.

The above reference is not to cast any aspersions on IBM Watson, which is a brilliant AI (Artificial Intelligence) system. It is used here only to highlight the perils of use of improper data. In the article ‘Predictably inaccurate: The prevalence and perils of bad big data’, Deloitte also “highlights the potential prevalence and types of inaccurate data from US-based data brokers”. And this is the reason why many data science and Analytics projects fail to deliver on their promises, because they are build on wrong premises.

So, it looks like the issues around accuracy of data are quite prevalent and people are seized of these issues and working towards potential solutions. Considering the prevalent issues concerning data and their impact on Analytics projects, it makes sense to engage experts to conduct objective audits of Analytics projects, aka an Analytics Audit.