What Is Data Science?
Oh yes, Science is everywhere. A while ago, when children embarked on the journey of learning everyday science in school, the statement that always had a mention was “Science is everywhere”. The situation is more or less the same even in present times. Science has now added a few feathers to its cap. Yes, the general masses sing the mantra “Data Science” is everywhere. What does it mean when I say Data Science is everywhere? Let us take a look at the Science of Data. What are those aspects that make this Science unique from everyday Science?
The Big Data Age as you may call it has in it Data as the object of study.
Data science is a platter full of data inference, algorithm development, and technology. This helps the users find recipes to solve analytically complex problems.
With data as the core, we have raw information that streams in and is stored in enterprise data warehouses acting as the condiments to your complex problems. To extract the best from the data generated, Data Science calls upon Data Mining. At the end of the tunnel, Data Science is about unleashing different ways to use data and generate value for various organizations.
Let us dig deeper into the tunnel and see how various domains make use of Data Science.
Think of a day without Data Science, Google would not have generated results the way it does today.
Suppose you manage an eatery that churns out the best for different taste buds. To model a product in the pipeline, you are keen on knowing what the requirements of your customers are. Now, you know they like more cheese on the pizza than jalapeno toppings. That is the existing data that you have along with their browsing history, purchase history, age and income. Now, add more variety to this existing data. With the vast amount of data that is generated, your strategies to bank upon the customers’ requirements can be more effective. One customer will recommend your product to another outside the circle; this will further bring more business to the organization.
Consider this image to understand how an analysis of the customers’ requirements helps:
Data Science plays its role in predictive analytics too.
I have an organization that is into building devices that will send a trigger if a natural calamity is soon to occur. Data from ships, aircraft, and satellites can be accumulated and analyzed to build models that will not only help with weather forecasting but also predict the occurrence of natural calamities. The model device that I build will send triggers and save lives too.
Consider the image shown below to understand how predictive analytics works:
A lot many of us who are active on social media would have come across this situation while posting images that show you indulging in all fun and frolic with your friends. You might miss tagging your friends in the images you post but the tag suggestion feature available on most platforms will remind you of the tagging that is pending.
The automatic tag suggestion feature uses the face recognition algorithm.
Capsulizing the main phases of the Data Science Lifecycle will help us understand how the Data Science process works. The various phases in the Data Science Lifecycle are:
Discovery marks the first phase of the lifecycle. When you set sail with your new endeavor,it is important to catch hold of the various requirements and priorities. The ideation involved in this phase needs to have all the specifications along with an outline of the required budget. You need to have an inquisitive mind to make the assessments – in terms of resources, if you have the required manpower, technology, infrastructure and above all time to support your project. In this phase, you need to have a business problem laid out and build an initial hypotheses (IH) to test your plan.
Data preparation is done in this phase. An analytical sandbox is used in this to perform analytics for the entire duration of the project. While you explore, preprocess and condition data, modeling follows suit. To get the data into the sandbox, you will perform ETLT (extract, transform, load and transform).
We make use of R for data cleaning, transformation, and visualization and further spot the outliers and establish a relationship between the variables. Once the data is prepared after cleaning, you can play your cards with exploratory analytics.
In this phase of Model planning, you determine the methods and techniques to pick on the relationships between variables. These relationships set the base for the algorithms that will be implemented in the next phase. Exploratory Data Analytics (EDA) is applied in this phase using various statistical formulas and visualization tools.
Subsequently, we will look into the various models that are required to work out with the Data Science process.
R is the most commonly used tool. The tool comes with a complete set of modeling capabilities. This proves a good environment for building interpretive models.
SQL Analysis services has the ability to perform in-database analytics using basic predictive models and common data mining functions.
SAS/ACCESS helps you access data from Hadoop. This can be used for creating repeatable and reusable model flow diagrams.
You have now got an overview of the nature of your data and have zeroed in on the algorithms to be used. In the next stage, the algorithm is applied to further build up a model.
This is the Model building phase as you may call it. Here, you will develop datasets for training and testing purposes. You need to understand whether your existing tools will suffice for running the models that you build or if a more robust environment (like fast and parallel processing) is required.
The various tools for model building are SAS Enterprise Miner, WEKA, SPCS Modeler, Matlab, Alpine Miner and Statistica.
In the Operationalize phase, you deliver final reports, briefings, code and technical documents. Moreover, a pilot project may also be implemented in a real-time production environment on a small scale. This helps users get a clear picture of the performance and other related constraints before full deployment.
The Communicate results phase is the conclusion. Here, we evaluate if you have been able to meet your goal the way you had planned in the initial phase. It is in this phase that the key findings pop their heads out. You communicate to the stakeholders in this phase. This phase brings you the result of your project whether it is a success or a failure.
Data Science to be precise is an amalgamation of Infrastructure, Software, Statistics and the various data sources.
To really understand big data, it would help us if we bridge back to the historical background. Gartner’s definition circa 2001, which is still the go-to definition says,
Big data is data that contains greater variety arriving in increasing volumes and with ever-higher velocity. This is known as the three Vs.
When we break the definition into simple terms, all that it means is, big data is humongous. This involves the multiplication of complex data sets with the addition of new data sources. When the data sets are in such high volumes, our traditional data processing software fails to manage them. It is just like how you cannot expect your humble typewriter to do the job of a computer. You cannot expect a typewriter to even do the ctrl c + ctrl v job for you. The amount of data that comes with the solutions to all your business problems is massive. To help you with the processing of this data, you have Data Science playing the key role.
The concept of big data itself may sound relatively new; however, the origins of large data sets can be traced back to the 1960s and the ’70s. This is when the world of data was just getting started. The world witnessed the set up of the first data centers and the development of the relational database.
Around 2005, Facebook, YouTube, and other online services started gaining immense popularity. The more people indulged in the use of these platforms, the more data they generated. The processing of this data involved a lot of Data Science. The masses had to store the amassed data and analyse it at a later point. As a platform that answers to the storage and analysis of the amassed data, Hadoop was developed. Hadoop is an open-source framework that helps in the storage and analysis of big data sets. And as we say, the rest will follow suit; we had NoSQL gaining popularity during this time.
With the advent of big data, the need for its storage also grew. The storage of data became a major issue for enterprise industries until 2010. We have had Hadoop, Spark and other frameworks mitigating the challenge to a very large extent. Though the volume of big data is skyrocketing, the focus remains on the processing of the data, all thanks to these efficient frameworks. And, Data Science once again hogs the limelight.
Can we say it is only the users leading to huge amounts of data? No, we cannot. It is not only humans generating the data but also the work they indulge in.
Delving into the iota of the Internet of Things (IoT) will get us some clarity on the question that we just raised. As we have more objects and devices connected to the Internet, data gathers not just by use but also by the pattern of your usage and the performance of the various products.
Data Science helps in the extraction of knowledge from the accumulated data. While big data has come far with the accumulation of users’ data, its usefulness is only just beginning.
Following are the Three Properties that define Big Data:
The amount of data is a crucial factor here. Big data stands as a pillar when you have to process a multitude of low-density, unstructured data. The data may contain unknown value – such as clickstreams on a webpage or a mobile app and Twitter data feeds. The values of the data may differ from user to user. For some, the value might be in tens of terabytes of data. For others, the value might be in hundreds of petabytes.
Consider the different social media platforms – Facebook records 2 billion users, YouTube has 1 billion users, 350 million users for Twitter and a whopping 700 million users on Instagram. There is exchange of billions of images, posts and tweets on these platforms. Imagine the amuck storage of data the users contribute too. Mind Boggling, is it not? This insanely large amount of data is generated every minute and every hour.
The fast rate at which the data is received and acted upon is the Velocity. Usually, the data is written to the disk. When there is data with highest velocity, it streams directly into the memory. With the advancement in technology, we now have more numbers of Internet-connected devices across industries. The velocity of the data generated through these devices that act real time or near real time may call for real-time evaluation and action.
Sticking to our social media example, Facebook accounts for 900 million photo uploads, Twitter handles 500 million tweets, Google is to go to solution for 3.5 billion searches, YouTube calls for 0.4 millions hours of video uploads; all this on a daily basis. The bundled amount of data is stifling.
The data generated by the users comes in different types. The different types form different varieties of data. Dating back, we had traditional data types that were structured and organized in a relational database.
Texts, tweets, videos, photos uploaded form the different varieties of structured data uploaded on the Internet.
Voicemails, emails, ECG reading, audio recordings and a lot more form the different varieties of unstructured data that we find on the Internet.
Deep thinking, deep learning with intense intellectual curiosity is a common trait found in data scientists. The more you ask questions, the more discoveries you come up with, the more augmented your learning experience is, the more it gets easier for you to tread on the path of Data Science.
A factor that differentiates a data scientist from a normal bread earner is that they are more obsessed with creativity and ingenuity. A normal bread earner will go seeking money whereas, the motivator for a data scientist is the ability to solve analytical problems with a pinch of curiosity and creativity. Data scientists are always on a treasure hunt – hunting for the best from the trove.
If you think, you need a degree in Sciences or you need to be a PhD in Math to become a legitimate data scientist, mind you, you are carrying a misconception. A natural propensity in these areas will definitely add to your profile but you can be an expert data scientist without a degree in these areas too. Data Science becomes a cinch with heaps of knowledge in programming and business acumen.
Data Science is a discipline gaining colossal prominence of late. Educational institutions are yet to come up with comprehensive Data Science degree programs. A data scientist can never claim to have undergone all the required schooling. Learning the rights skills, guided by self-determination is a never-ending process for a data scientist.
As Data Science is multidisciplinary, many people find it confusing to differentiate between Data Scientist and Data Analyst.
Data Analytics is one of the components of Data Science. Analytics help in understanding the data structure of an organization. The achieved output is further used to solve problems and ring in business insights.
Scientists and Analysts are not exactly synonymous. The roles are not mutually exclusive either. The roles of Data Scientists and Data Analysts differ a lot. Let us take a look at some of the basic differences:
Data scientists blend with the best skills. The fundamental skills required to become a Data Scientist are as follows:
A Data Scientist needs to be equipped with a quantitative lens. You can be a Data Scientist if you have the ability to view the data quantitatively.
Before a data product is finally built, it calls for a tremendous amount of data insight mining. There are portions of data that include textures, dimensions and correlations. To be able to find solutions to come with an end product, a mathematical perspective always helps.
If you have that knack for Math, finding solutions utilizing data becomes a cakewalk laden with heuristics and quantitative techniques. The path to finding solutions to major business problems is a tedious one. It involves the building of analytical models. Data Scientists need to identify the underlying nuts and bolts to successfully build models.
Data Science carries with it a misconception that it is all about statistics. Statistics is crucial; however, only the Math type is more accountable. Statistics has two offshoots – the classical and the Bayesian. When people talk about stats, they are usually referring to classical stats. Data Scientists need to refer both types to arrive at solutions. Moreover, there is a mix of inferential techniques and machine learning algorithms; this mix leans on the knowledge of linear algebra. There are popular methods in Data Science; finding a solution using these methods calls upon matrix math which has got very less to do with classical stats.
On a lighter note, let us put a disclaimer… you are not being asked to learn hacking to come crashing on computers. As a hacker, you need to be gelled with the amalgam of creativity and ingenuity. You are expected to use the right technical skills to build models and thereby find solutions to complex analytical problems.
Why does the world of Data Science vouch on your hacking ability? The answer finds its element in the use of technology by Data Scientists. Mindset, training and the right technology when put together can squeeze out the best from mammoth data sets. Solving complex algorithms requires more sophisticated tools than just Excel. Data scientists need to have the nitty-gritty ability to code. They should be able to prototype quick solutions, as well as integrate with complex data systems. SQL, Python, R, and SAS are the core languages associated with Data Science. A knowhow of Java, Scala, Julia, and other languages also helps. However, the knowledge of language fundamentals does not suffice the quest to extract the best from enormous data sets. A hacker needs to be creative to sail through technical waters and make the codes reach the shore.
A strong business acumen is a must-have in the portfolio of any Data Scientist. You need to make tactical moves and fetch that from the data, which no one else can. To be able to translate your observation and make it a shared knowledge calls for a lot of responsibility that can face no fallacy.
With the right business acumen, a Data Scientist finds it easy to present a story or the narration of a problem or a solution.
To be able to put your ideas and the solutions you arrive at across the table, you need to have business acumen along with the prowess for tech and algorithms.
Data, Math, and tech will not help always. You need to have a strong business influence that can further be influenced by a strong business acumen.
To address the issues associated with the management of complex and expanding work environments, IT organizations make use of data to identify new value sources. The identification helps them exploit future opportunities and to further expand their operations. What makes the difference here is the knowledge you extract from the repository of data. The biggest and the best companies use analytics to efficiently come up with the best business models.
Following are a few top companies that use Data Science to expand their services and increase their productivity.
Google has always topped the list on a hiring spree for top-notch data scientists. A force of data scientists, artificial intelligence and machine learning by far drives Google. Moreover, when you are here, you get the best when you give the best of your data expertise.
Amazon, the global e-commerce and cloud computing giant hire data scientists on a big scale. To bank upon the customers’ mindsets, enhance the geographical outreach of both the cloud domain and e-commerce domain among other business-driven goals, they make use of Data Science. Data Scientists play a crucial role in steering Data Science.
It has answers to a range of business problems – from customer experience to analytics.
Netflix and Procter & Gamble join the race of product development by using big data to anticipate customer demand. They make use of predictive analytics, an offshoot of Data Science to build models for services in their pipeline. This modelling is an attribute that contributes to their commercial success. The significant addition to the commercial success of P&G is that it uses data and analytics from test markets, social media, and early store rollouts. Following this strategy, it further plans, produces, and launches the final products. And, the finale often garners an overwhelming response for them.
When speed multiplied with storage capabilities, thus evolved the final component of the Big Data story – the generation and collection of the data. If we still had massive room-sized calculators working as computers, we may not have come across the humongous amount of data that we see today. With the advancement in technology, we called upon ubiquitous devices. With the increase in the number of devices, we have more data being generated. We are generating data at our own pace from our own space owing to the devices that we make use of from our comfort zones. Here I tweet, there you post, while a video is being uploaded on some platform by someone from some corner of the room you are seated in.
The more you inform people about what you are doing in your life, the more data you end up writing. I am happy and I share a quote on Facebook expressing my feelings; I am contributing to more data. This is how enormous amount of data is generated. The Internet-connected devices that we use support in writing data. Anything that you engage with in this digital world, the websites you browse, the apps you open on your cell phone, all the data pertaining to these can be logged in a database miles away from you.
Writing data and storing it is not an arduous task anymore. At times, companies just push the value of the data to the backburner. At some point of time, this data will be fetched and cooked when they see the need for it.
There are different ways to cash upon the billions of data points. Data Science puts the data into categories to get a clear picture.
If you are an organization looking out to expand your horizons, being data-driven will take you miles. The application of an amalgam of Infrastructure, Software and Statistics, and the various data sources is the secret formula to successfully arrive at key business solutions. The future belongs to Data Science. Today, it is data that we see all around us. This new age sounds the bugle for more opportunities in the field of Data Science. Very soon, the world will need around one million Data Scientists.
If you are keen on donning the hat of a Data Scientist, be your own architect when it comes to solving analytical problems. You need to be a highly motivated problem solver to overcome the toughest analytical challenges.
Research & References of What Is Data Science?|A&C Accounting And Tax Services