Data Science Tutorial for Beginners

Plainly stated, data science involves extracting knowledge from data you gather using different methodologies. As a data scientist, you take a complex business problem, compile research from it, creating it into data, then use that data to solve the problem.

Pretty nifty, huh?

All you need is a clear, deep understanding of a business’ domain and a lot of creativity – which, undoubtedly, you have. A significant area of interest in data science concerns fraud, especially internet fraud. Here, data scientists create algorithms to detect fraud and prevent it by using their skills.

You’ll learn about data science from scratch in this article, including career fields for data scientists, real-world data science applications and how to get started in data science. Let us begin this data science tutorial for beginners by understanding the responsibilities of a data scientist.

What Does a Data Scientist Do?

Data scientists work in a variety of fields. Each is crucial to finding solutions to problems and requires specific knowledge. These fields include data acquisition, preparation, mining and modeling, and model maintenance. Data scientists take raw data, turn it into a goldmine of information with the help of machine learning algorithms that answer questions for businesses seeking solutions to their queries. Each field can be defined as follows:

Interested to become a Data Scientist? Take up this Data Science with python Test and assess your level of understanding!
  • Data Acquisition: Here, data scientists take data from all its raw sources, such as databases and flat-files. Then, they integrate and transform it into a homogenous format, collecting it into what is known as a “data warehouse,” a system by which the data can be used to extract information from easily. Also known as ETL, this step can be done with some tools, such as Talend Studio, DataStage and Informatica.
  • Data Preparation: This is the most important stage, wherein 60 percent of a data scientist’s time is spent because often data is “dirty” or unfit for use and must be scalable, productive and meaningful. In fact, five sub-steps exist here:
    1. Data Cleaning: Important because bad data can lead to bad models, this step handles missing values and null or void values that might cause the models to fail. Ultimately, it improves business decisions and productivity.
    2. Data Transformation: Takes raw data and turns it into desired outputs by normalizing it. This step can use, for example, min-max normalization or z-score normalization.
    3. Handling Outliers: This happens when some data falls outside the scope of the realm of the rest of the data. Using exploratory analysis, a data scientist quickly uses plots and graphs to determine what to do with the outliers and see why they’re there. Often, outliers are used for fraud detection.
    4. Data Integration: Here, the data scientist ensures the data is accurate and reliable.
    5. Data Reduction: This compiles multiple sources of data into one, increases storage capabilities, reduces costs and eliminates duplicate, redundant data.
  • Data Mining: Here, data scientists uncover the data patterns and relationships to take better business decisions. It’s a discovery process to get hidden and useful knowledge, commonly known as exploratory data analysis. Data mining is useful for predicting future trends, recognizing customer patterns, helping to make decisions, quickly detecting fraud and choosing the correct algorithms. Tableau works nicely for data mining.
  • Model Building: This goes further than simple data mining and requires building a machine learning model. The model is built by selecting a machine learning algorithm that suits the data, problem statement and available resources.
    Machine learning algorithms used by Data scientists
    There are two types of machine learning algorithms: Supervised and Unsupervised:
    1. Supervised: Supervised learning algorithms are used when the data is labeled. There are two types:
      • Regression: When you need to predict continuous values and variables are linearly dependent, algorithms used are linear and multiple regression, decision trees and random forest
      • Classification: When you need to predict categorical values, some of the classification algorithms used are KNN, logistic regression, SVM and Naïve-Bayes
    2. Unsupervised: Unsupervised learning algorithms are used when the data is unlabeled, there is no labeled data to learn from. There are two types:
      • Clustering: This is the method of dividing the objects which are similar between them and dissimilar to others. K-Means and PCA clustering algorithms are commonly used.
      • Association-rule analysis: This is used to discover interesting relations between variables, Apriori and Hidden Markov Model algorithm can be used
  • Model Maintenance: After gathering data and performing the mining and model building, data scientists must maintain the model accuracy. Thus, they take the following steps:
    1. Assess: Running a sample through the data occasionally to make sure it remains accurate
    2. Retrain: When the results of the reassessment aren’t right, the data scientist must retrain the algorithm to provide the correct results again
    3. Rebuild: If retraining fails, rebuilding must occur.

As you can see, data science is a complex process of various steps taking massive effort to achieve continuous, excellent results.

Now that you understand what a data scientist does, let’s look at a few examples of data science at work in the next section of the data science tutorial.

Data Science in Action: Two Examples

Data science uses its raw data to help solve problems. In each of these two cases, data helped solve a question plaguing people – in the first, a bank needed to understand why customers were leaving, this example focuses on data mining using Tableau. In the second, curiosity existed about what countries had the highest happiness rates, this example focuses on model building. Without data science, the answers couldn’t be found.

Example One: Customer Exit Rate at a Bank

Data Preparation – Data Cleaning - Bank example

Here, a bank is doing a bit of data cleaning using Python. The customer loads a CSV file and discovers missing values in some subsets, such as the geography field. In this case, the data scientist needs to fill in the empty values with something to even out the data set, so the data is filled in with the “mean” score by writing a piece of code to do so. Otherwise, statistical data won’t work.

Data preparation - data cleaning - empty string in geography column

A data scientist can take other steps when data is missing, however. For example, one could drop the entire row – but that’s quite drastic and may skew the results of the study.

Data preparation - Data cleaning - data sets

If all the columns are empty, though, one can drop those. In addition, when 10 to 20 rows exist, and five to seven are blank, one can drop the five to seven without worrying that the results will change much.

After the data is cleaned, the data scientist is ready to use the data for data mining.

Data Mining using Tableau

Now, the data scientist uses Tableau to look at the exit rate of the bank’s customers based on gender, credit card holding and geography to see if these are affecting that rate.

Gain the required skills in Data Science with the Data Scientist Master's Program! Click here to learn the most in-demand technologies!

Tableau uses a drag-and-drop system to analyze data, so, to analyze gender first, the data scientist puts “Exited” into the “Dimensions” section of Tableau and “Gender” into its “Measures” section.

Data mining using tableau - identify fields

This creates two columns, one for males and one for females, and two values, 0 for those who didn’t exit, and one for those who did.

Excel sheet - customers who exit

Then, a bar graph shows the percentages of the values. The data reveals a difference between females and males.

Bar chart for customers who exited

Doing the same for credit cards shows no impact, but geography also shows impact.

Data mining using tableau - basis of geography

As a result, the study shows that the bank should consider the gender and location of its customers when analyzing how it can better retain them. Thanks to data science, then, the bank learns important information about client behavior.

Example Two: Predicting World Happiness

Predicting world happiness sounds like an impossible goal, no? Thanks to data science, it’s not! Rather, using multiple linear regression model building, it’s possible to assess it. Let’s see how.

To do this, one first must ascribe values. In this case, they are happiness rank, happiness value, country, region, economy, family, health, freedom, trust, generosity and dystopian residual. Not all need to be used, but some must be to make and train the model.

Using Python, the data scientist imports libraries such as pandas, numpys, and sklearns. Data is imported as CSV files from the years 2015, 2016 and 2017. Next, the scientist can concatenate the three data or build one model for each CSV. Ultimately, the head() shows the top countries with the highest happiness score.

Plots and graphs arise in Python to show which countries are the happiest and which are less happy. A scatterplot shows the correlation between happiness rank and happiness score; it’s inversely correlated. More plots show that they convey the same message, so the happiness rank score can be dropped.

As the data finishes processing, it’s possible to remove the country names and plot out the most important factors that determine world happiness. The top one, as you might imagine, is the happiness score. From the analysis, the second most important element is the economy, then family and health. Thanks to the highly detailed workings of Python’s multiple linear regression model building, we can now predict world happiness! Hoorah!

As we’ve shown, then, data science with Python can help achieve even the loftiest sounding data analysis.

Learn More About Data Science

We know – we’ve piqued your interest, right? You’re undoubtedly eager to learn more after reading this data science tutorial. No worries! Simplilearn offers many courses in Data Science, such as a Data Scientist Master’s Program, an Integrated Program in Big Data and Data Science and Data Science with Python Training Course. Take a look at all our Data Science programs and choose the one(s) that are right for you. You’ll be a data master in no time.

About the Author

SimplilearnSimplilearn

Simplilearn is one of the world’s leading providers of online training for Digital Marketing, Cloud Computing, Project Management, Data Science, IT, Software Development, and many other emerging technologies. Based in San Francisco, California, and Bangalore, India, Simplilearn has helped more than 500,000 students, professionals and companies across 200 countries get trained, upskilled, and acquire certifications.

View More
  • Disclaimer
  • PMP, PMI, PMBOK, CAPM, PgMP, PfMP, ACP, PBA, RMP, SP, and OPM3 are registered marks of the Project Management Institute, Inc.