Getting Started With Python for Data Science

Python is the world's most popular programming language for Data Science, and for a very good reason. It is simple, powerful, and has tools that can handle everything from cleaning messy data to building smart AI models. This tutorial is written in easy, school-level language so that anyone, whether you are in Class 10, a college student, or a working professional starting fresh, can follow along without confusion. By the end of this tutorial, you will understand how Python works, how to use it for data analysis, how to visualize data, and how to build your first Machine Learning model.

What is Data Science?

Before we write a single line of code, let us understand what Data Science actually is.

Data Science is the process of collecting, cleaning, analyzing, and understanding large amounts of data to find patterns, make decisions, and predict future outcomes. Think of it like being a detective. A detective collects clues (data), connects the dots (analysis), and then arrives at a conclusion (insight or prediction).

For example, a school principal can use Data Science to find out which students are likely to fail and help them before the exams. A shopkeeper can use it to figure out which products sell best during the winter season. A hospital can use it to predict which patients need urgent care.

Data Science is used everywhere today, including in entertainment (Netflix recommending movies), in finance (banks detecting fraud), in healthcare (diagnosing diseases), in sports (predicting match outcomes), and in education (personalizing learning paths).

Why Python for Data Science?

There are many programming languages in the world, such as R, Java, C++, and Julia. But Python has become the clear winner for Data Science. Here is why:

1. It reads like English. Python's syntax is so clean and simple that even a beginner can understand what a piece of code is doing just by reading it. Compare this to languages like Java, which require a lot of extra code just to print "Hello World."

2. It has powerful libraries. A library is like a pre-built toolbox. Instead of writing thousands of lines of code yourself, you can simply import a library and use its ready-made tools. NumPy handles math, Pandas handles tables, Matplotlib draws charts, and Scikit-Learn builds machine learning models.

3. It has a massive community. Millions of developers, data scientists, and researchers use Python. This means whenever you are stuck, you can find a solution on Stack Overflow, YouTube, or GitHub within minutes.

4. It is free and open-source. You do not have to pay a single rupee to use Python or any of its Data Science libraries.

5. It runs everywhere. Python works on Windows, Mac, and Linux. You can even run it in your browser using Google Colab.

Setting Up Your Python Environment

Let us set up the tools you need before writing any code.

Option 1: Install Anaconda

Anaconda is a free software package that installs Python along with all the major Data Science libraries in one go. It also includes the Jupyter Notebook, which is the best tool for writing Data Science code.

Here is how to set it up:

Go to anaconda.com in your browser
Click on "Download" and choose your operating system (Windows, Mac, or Linux)
Run the installer and follow the on-screen instructions
Once installed, open Anaconda Navigator
Click on Jupyter Notebook to launch it
A browser window will open. Click "New" and then "Python 3" to start coding

Option 2: Use Google Colab (No Installation Required)

If you do not want to install anything on your computer, Google Colab is the perfect option. It is a free, cloud-based tool by Google that lets you write and run Python code directly in your browser.

Go to colab.research.google.com
Sign in with your Google account
Click "New Notebook"
Start writing and running Python code immediately

Google Colab is especially great because it gives you free access to powerful computers (GPU and TPU) for running heavy machine learning models.

Installing Individual Libraries

If you already have Python installed and just need specific libraries, you can install them using the pip command in your terminal:

pip install numpy pandas matplotlib seaborn scikit-learn

Python Basics: The Building Blocks

Think of Python like a language. Just like English has words, grammar, and sentences, Python has values, rules, and statements. Let us learn all the basics.

Printing Output

The very first thing you learn in any programming language is how to display something on the screen. In Python, you use the print() function.

print("Hello, World!") print("Welcome to Python for Data Science!")

Output:

Hello, World!

Welcome to Python for Data Science!

Simple, right? You just write what you want to display inside the parentheses with quotes.

Variables

A variable is like a labeled container or a box. You put a value inside it and give it a name. Whenever you need that value, you just use the name.


student_name = "Pooja" 
student_age = 16 
student_marks = 92.5 
is_pass = True 

print(student_name) #Pooja 
print(student_age) #16 
print(student_marks) #92.5 
print(is_pass) #True

Notice that you do not need to say what type of value you are storing. Python figures it out automatically. This is called dynamic typing.

Data Types

Python has several built-in data types. These are the most important ones for Data Science:

Integer (int): Whole numbers without decimals

age = 25 score = 100

Float (float): Numbers with decimal points

temperature = 36.6 gpa = 3.75

String (str): Text, words, or sentences (always inside quotes)

city = "Delhi" message = "Python is amazing!"

Boolean (bool): Only two values, True or False

is_enrolled = True has_graduated = False

Type Checking and Conversion:


print(type(age)) #  
print(type(gpa)) #  
print(type(city)) #  

# Converting types 
print(float(age)) # 25.0 
print(int(3.99)) # 3 (decimal part is cut, not rounded) 
print(str(100)) # "100"

Arithmetic Operators

Python can do all kinds of math:


a = 20 
b = 6 
print(a + b) # Addition: 26 
print(a - b) # Subtraction: 14 
print(a * b) # Multiplication: 120 
print(a / b) # Division: 3.3333... 
print(a // b) # Floor Division (no decimal): 3 
print(a % b) # Modulus (remainder): 2 
print(a ** 2) # Power (a squared): 400

Comparison Operators

These return either True or False. You will use them in conditions and filters.


print(10 > 5) # True 
print(10 < 5) # False 
print(10 == 10) # True (double equals checks equality) 
print(10 != 5) # True (not equal) 
print(10 >= 10) # True 
print(10 <= 9) # False

Logical Operators


x = 85 
print(x > 70 and x < 90) # True (both conditions must be true) 
print(x > 90 or x > 80) # True (at least one must be true) 
print(not(x > 90)) # True (flips the result)

Data Structures: Storing Multiple Values

In real Data Science projects, you never work with just one value. You work with hundreds or thousands of values at the same time. Python has special structures to store all of them.

Lists

A list stores multiple values in a single variable. Values are stored in order and can be changed.


marks = [85, 92, 78, 95, 88] 
names = ["Alice", "Bob", "Charlie", "David", "Eve"] 
print(marks[0]) # 85 (first item, indexing starts at 0)
print(marks[-1]) # 88 (last item using negative index)
print(marks[1:4]) # [92, 78, 95] (slicing: items 1 to 3)

Useful list methods:


marks.append(91) # Add 91 to the end 
marks.remove(78) # Remove the value 78 
marks.sort() # Sort in ascending order 
marks.reverse() # Reverse the list 
print(len(marks)) # Count total items 
print(sum(marks)) # Add all items 
print(min(marks)) # Find minimum 
print(max(marks)) # Find maximum

Tuples

A tuple is exactly like a list, but once created, you cannot change, add, or remove its values. It is immutable (unchangeable). Use tuples for data that should remain fixed.


dimensions = (1920, 1080) # Screen resolution 
coordinates = (28.61, 77.20) # Delhi's coordinates print(dimensions[0]) # 1920 
print(coordinates[1]) # 77.20

Dictionaries

A dictionary stores data as key-value pairs. Think of a real dictionary where every word (key) has a meaning (value). This is extremely useful for storing structured records.


student = {
    "name": "Rohan",
    "age": 19,
    "course": "Data Science",
    "marks": 88
}

print(student["name"])          # Rohan
print(student["marks"])         # 88

student["marks"] = 93           # Update a value
student["city"] = "Delhi"       # Add a new key-value pair
del student["age"]              # Delete a key

print(student.keys())           # All keys
print(student.values())         # All values
print(student.items())          # All key-value pairs

Sets

A set stores only unique values. It automatically removes any duplicates. Sets are useful for finding unique items in a dataset.


numbers = {1, 2, 3, 3, 4, 4, 5}
print(numbers)    # {1, 2, 3, 4, 5} (duplicates removed)

set_a = {1, 2, 3, 4}
set_b = {3, 4, 5, 6}
print(set_a & set_b)    # Intersection: {3, 4}
print(set_a | set_b)    # Union: {1, 2, 3, 4, 5, 6}
print(set_a - set_b)    # Difference: {1, 2}

Control Flow: Making Decisions and Repeating Tasks

If-Elif-Else Statements

Conditional statements let your program make decisions. Think of it as teaching Python: "If this is true, do this. Otherwise, do that."


score = 83

if score >= 90:
    print("Grade: A - Excellent!")
elif score >= 75:
    print("Grade: B - Good Job!")
elif score >= 60:
    print("Grade: C - Keep Improving!")
elif score >= 40:
    print("Grade: D - Needs Attention!")
else:
    print("Grade: F - Please Revisit the Concepts!")

# Output: Grade: B - Good Job!

For Loops

A for loop repeats a block of code for every item in a sequence. This is extremely useful when working with large datasets.

subjects = ["Maths", "Science", "English", "Hindi", "Computer"]
for subject in subjects: print("Today's class:", subject)

Looping with range():

The range() function generates a sequence of numbers, which is very useful for running a loop a specific number of times.


for i in range(1, 6):
    print("Step", i)
# Output: Step 1, Step 2, Step 3, Step 4, Step 5

for i in range(0, 20, 5):
    print(i)
# Output: 0, 5, 10, 15 (step size of 5)

# Output: Grade: B - Good Job!

Looping through a list with an index:

fruits = ["Apple", "Banana", "Mango"]
for index, fruit in enumerate(fruits): print(index, "->", fruit)# Output:# 0 -> Apple# 1 -> Banana# 2 -> Mango

While Loops

A while loop keeps running as long as a condition is True. It is like a machine that keeps working until you press stop.

countdown = 5
while countdown > 0: print("Countdown:", countdown) countdown -= 1
print("Launch!")

Break and Continue

break stops the loop completely
continue skips the current step and moves to the next one

for num in range(1, 10): if num == 5: break # Stop the loop when num is 5 print(num)# Output: 1, 2, 3, 4
for num in range(1, 10): if num % 2 == 0: continue # Skip even numbers print(num)# Output: 1, 3, 5, 7, 9

List Comprehension

This is a clever Python shortcut to create a list in just one line. It replaces 3-4 lines of a for loop.

# Regular methodsquares = []for x in range(1, 6): squares.append(x ** 2)
# List comprehension method (same result in one line)squares = [x**2 for x in range(1, 6)]print(squares) # [1, 4, 9, 16, 25]
# With conditioneven_squares = [x**2 for x in range(1, 11) if x % 2 == 0]print(even_squares) # [4, 16, 36, 64, 100]

Functions: Write Once, Use Many Times

A function is a named block of code that performs a specific task. You write it once and call it whenever you need it. This saves time and avoids repetition.

Defining and Calling Functions

def greet_student(name): print("Hello,", name, "! Welcome to The IoT Academy.")
greet_student("Namrata")greet_student("Pooja")
# Output:# Hello, Namrata! Welcome to The IoT Academy.# Hello, Pooja ! Welcome to The IoT Academy.

Functions that Return Values

def calculate_percentage(marks_obtained, total_marks): percentage = (marks_obtained / total_marks) * 100 return percentage
result = calculate_percentage(450, 500)print("Percentage:", result, "%")# Output: Percentage: 90.0 %

Default Parameters

You can set a default value for a parameter so that it works even if the user does not provide that value.

def greet(name, language="English"): if language == "English": print("Hello,", name) elif language == "Hindi": print("Namaste,", name)
greet("Ravi") # Uses default languagegreet("Anjali", "Hindi") # Overrides with Hindi

*args and **kwargs

These are used when you do not know how many arguments will be passed to your function.

def add_all(*numbers): return sum(numbers)
print(add_all(1, 2, 3)) # 6print(add_all(10, 20, 30, 40)) # 100

Lambda Functions

A lambda function is a small, anonymous function written in one line. It is very popular in Data Science for quick operations on data.

double = lambda x: x * 2square = lambda x: x ** 2add = lambda a, b: a + b
print(double(5)) # 10print(square(4)) # 16print(add(3, 7)) # 10

Python for Exploratory Data Analysis (EDA)

EDA is the step where you understand your data deeply before building any model. Think of it as reading a textbook thoroughly before attempting an exam. A good EDA helps you spot problems, understand patterns, and decide what to do next.

Here is a complete EDA workflow using the famous Titanic dataset:


import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

# Step 1: Load the data
df = sns.load_dataset("titanic")

# Step 2: Basic Overview
print("Shape:", df.shape)
print("\nFirst 5 rows:")
print(df.head())
print("\nData Types and Nulls:")
print(df.info())

# Step 3: Statistical Summary
print("\nStatistical Summary:")
print(df.describe())

# Step 4: Missing Values
print("\nMissing Values:")
print(df.isnull().sum())

# Step 5: Target Variable Distribution
print("\nSurvival Rate:")
print(df["survived"].value_counts())
sns.countplot(x="survived", data=df)
plt.title("Survival Count (0=Died, 1=Survived)")
plt.show()

# Step 6: How class affected survival
sns.barplot(x="pclass", y="survived", data=df)
plt.title("Survival Rate by Passenger Class")
plt.show()

# Step 7: Age distribution
df["age"].hist(bins=30, color="skyblue", edgecolor="black")
plt.title("Age Distribution of Titanic Passengers")
plt.xlabel("Age")
plt.show()

# Step 8: Correlation Heatmap
numeric_df = df.select_dtypes(include="number")
sns.heatmap(numeric_df.corr(), annot=True, cmap="coolwarm")
plt.title("Correlation Heatmap")
plt.show()

Data Preprocessing: Cleaning and Preparing Data for Models

Before feeding data into any Machine Learning model, you must preprocess it. Raw data is almost always messy, inconsistent, and incomplete. Preprocessing turns it into a clean, ready-to-use format.

Step 1: Handle Missing Values


# Drop rows with missing target variable
df.dropna(subset=["survived"], inplace=True)

# Fill missing age with median
df["age"].fillna(df["age"].median(), inplace=True)

# Fill missing embarked with mode (most common value)
df["embarked"].fillna(df["embarked"].mode()[0], inplace=True)

Step 2: Encode Categorical Variables

Machine Learning models only understand numbers. So you need to convert text categories into numbers.

Label Encoding:

Gives each category a number (used when categories have an order)

from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()df["sex_encoded"] = le.fit_transform(df["sex"])# female -> 0, male -> 1

One-Hot Encoding:

Creates a new column for each category (used when categories have no order)

df = pd.get_dummies(df, columns=["embarked"], drop_first=True)

Step 3: Feature Scaling (Normalization and Standardization)

When features have very different scales (e.g., Age ranges from 1-80 but Fare ranges from 0-500), it can confuse the model. Scaling brings them to the same range.

Standardization (Z-score scaling):

Mean becomes 0, Standard Deviation becomes 1

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()df[["age", "fare"]] = scaler.fit_transform(df[["age", "fare"]])

Normalization (Min-Max Scaling):

All values become between 0 and 1

from sklearn.preprocessing import MinMaxScaler
minmax = MinMaxScaler()df[["age", "fare"]] = minmax.fit_transform(df[["age", "fare"]])

Step 4: Feature Selection

Not all columns are useful for prediction. Select only the relevant features:

features = ["pclass", "sex_encoded", "age", "fare", "embarked_Q", "embarked_S"]X = df[features]y = df["survived"]

Machine Learning with Scikit-Learn

Machine Learning is the most exciting part of Data Science. It is the process of training a computer to make predictions using patterns it finds in data. Scikit-Learn (sklearn) is Python's most popular and beginner-friendly ML library.

The Universal ML Workflow

No matter which algorithm you use, the steps are always the same:

Prepare the data (clean, encode, scale)
Split into training and testing sets
Choose a model
Train (fit) the model on training data
Test (predict) on testing data
Evaluate the model's accuracy

Train-Test Split


from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.2,        # 20% for testing, 80% for training
    random_state=42       # For reproducibility
)

print("Training samples:", len(X_train))
print("Testing samples:", len(X_test))

Linear Regression (Predicting a Continuous Number)

Linear Regression draws the best straight line through data points to predict a value. For example, predicting a student's exam score based on hours of study.


from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

model = LinearRegression()
model.fit(X_train, y_train)             # Train the model

predictions = model.predict(X_test)    # Make predictions

print("MSE:", mean_squared_error(y_test, predictions))
print("R2 Score:", r2_score(y_test, predictions))
# R2 Score of 1.0 means perfect prediction. Closer to 1 is better.

Logistic Regression (Predicting Yes or No)

Despite the name, Logistic Regression is used for classification (predicting categories, not numbers). For example, predicting whether a passenger survived (1) or not (0).


from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

model = LogisticRegression()
model.fit(X_train, y_train)

predictions = model.predict(X_test)

print("Accuracy:", accuracy_score(y_test, predictions))
print("\nClassification Report:")
print(classification_report(y_test, predictions))

# Confusion Matrix
cm = confusion_matrix(y_test, predictions)
sns.heatmap(cm, annot=True, fmt="d", cmap="Blues")
plt.title("Confusion Matrix")
plt.xlabel("Predicted")
plt.ylabel("Actual")
plt.show()

Decision Tree Classifier

A Decision Tree makes decisions by asking yes/no questions about the features, like a flowchart. It is very easy to understand and explain.


from sklearn.tree import DecisionTreeClassifier

model = DecisionTreeClassifier(max_depth=5, random_state=42)
model.fit(X_train, y_train)
predictions = model.predict(X_test)

print("Decision Tree Accuracy:", accuracy_score(y_test, predictions))

Random Forest Classifier

A Random Forest builds hundreds of Decision Trees and combines their predictions. This makes it much more powerful and accurate.


from sklearn.ensemble import RandomForestClassifier

model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)
predictions = model.predict(X_test)

print("Random Forest Accuracy:", accuracy_score(y_test, predictions))

# Feature importance
importances = pd.Series(model.feature_importances_, index=X.columns)
importances.sort_values(ascending=False).plot(kind="bar", color="teal")
plt.title("Feature Importance")
plt.show()

K-Nearest Neighbors (KNN)

KNN classifies a new data point based on the k most similar points in the training data. Think of it as: "You are judged by the company you keep."

from sklearn.neighbors import KNeighborsClassifier
model = KNeighborsClassifier(n_neighbors=5)model.fit(X_train, y_train)predictions = model.predict(X_test)
print("KNN Accuracy:", accuracy_score(y_test, predictions))

Support Vector Machine (SVM)

SVM finds the best boundary (called a hyperplane) that separates different classes with maximum margin.

from sklearn.svm import SVC
model = SVC(kernel="rbf", random_state=42)model.fit(X_train, y_train)predictions = model.predict(X_test)
print("SVM Accuracy:", accuracy_score(y_test, predictions))

Model Evaluation Metrics

Choosing the right metric to evaluate your model is just as important as building the model itself.

Metric	When to Use	Formula
Accuracy	Balanced classes	Correct Predictions / Total Predictions
Precision	When false positives are costly	True Positives / (True Positives + False Positives)
Recall	When false negatives are costly	True Positives / (True Positives + False Negatives)
F1-Score	Imbalanced datasets	2 * (Precision * Recall) / (Precision + Recall)
R2 Score	Regression problems	Measures how well the model fits the data

from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
print("Accuracy:", accuracy_score(y_test, predictions))print("Precision:", precision_score(y_test, predictions))print("Recall:", recall_score(y_test, predictions))print("F1 Score:", f1_score(y_test, predictions))

Cross Validation (Testing Your Model More Reliably)

Instead of testing your model just once, cross-validation tests it multiple times on different parts of the data and averages the results. This gives you a much more reliable estimate of your model's performance.

from sklearn.model_selection import cross_val_score
model = RandomForestClassifier(n_estimators=100, random_state=42)scores = cross_val_score(model, X, y, cv=5, scoring="accuracy")
print("Cross-Validation Scores:", scores)print("Mean Accuracy:", scores.mean())print("Standard Deviation:", scores.std())

Hyperparameter Tuning

Every model has settings called hyperparameters that you can adjust to make it perform better. GridSearchCV automatically tries all combinations and finds the best one.

from sklearn.model_selection import GridSearchCV
param_grid = { "n_estimators": [50, 100, 200], "max_depth": [3, 5, 10, None]}
grid_search = GridSearchCV(RandomForestClassifier(random_state=42), param_grid, cv=5, scoring="accuracy")grid_search.fit(X_train, y_train)
print("Best Parameters:", grid_search.best_params_)print("Best Accuracy:", grid_search.best_score_)

Building a Complete End-to-End Data Science Project

Now let us put everything together into one complete project using the Titanic dataset:


import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import accuracy_score, classification_report

# 1. Load Data
df = sns.load_dataset("titanic")

# 2. EDA
print(df.head())
print(df.info())
print(df.isnull().sum())

# 3. Data Cleaning
df = df[["survived", "pclass", "sex", "age", "fare", "embarked"]].copy()
df["age"].fillna(df["age"].median(), inplace=True)
df["embarked"].fillna(df["embarked"].mode()[0], inplace=True)

# 4. Encoding
le = LabelEncoder()
df["sex"] = le.fit_transform(df["sex"])
df = pd.get_dummies(df, columns=["embarked"], drop_first=True)

# 5. Feature Selection
X = df.drop("survived", axis=1)
y = df["survived"]

# 6. Split Data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# 7. Train Model
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)

# 8. Evaluate
predictions = model.predict(X_test)
print("\nAccuracy:", accuracy_score(y_test, predictions))
print("\nClassification Report:")
print(classification_report(y_test, predictions))

# 9. Visualize Feature Importance
importances = pd.Series(model.feature_importances_, index=X.columns)
importances.sort_values(ascending=False).plot(kind="bar", color="steelblue")
plt.title("Feature Importance - Titanic Survival")
plt.tight_layout()
plt.show()

Important Python Libraries Cheat Sheet

Here is your complete reference guide for Data Science libraries in Python:

Library	Purpose	Key Use Cases
NumPy	Numerical computing	Arrays, math operations, random numbers
Pandas	Data manipulation	DataFrames, CSV loading, cleaning, groupby
Matplotlib	Basic visualization	Line, bar, scatter, pie charts
Seaborn	Statistical visualization	Heatmaps, box plots, pair plots
Scikit-Learn	Machine Learning	Regression, classification, preprocessing
SciPy	Scientific computing	Statistics, linear algebra, optimization
Statsmodels	Statistical modeling	Hypothesis tests, regression analysis
Plotly	Interactive charts	Web-based dynamic visualizations

Your Step-by-Step Learning Roadmap

Follow this roadmap in order, and you will go from a complete beginner to a confident Data Science practitioner:

Week 1-2: Python Basics (variables, loops, functions, data structures)
Week 3: NumPy (arrays, math, slicing, vectorized operations)
Week 4: Pandas (DataFrames, CSV, cleaning, filtering, groupby)
Week 5: Matplotlib and Seaborn (all chart types)
Week 6: EDA on real datasets (Titanic, Iris, Tips)
Week 7: Data Preprocessing (encoding, scaling, handling missing values)
Week 8-10: Machine Learning with Scikit-Learn (Linear Regression, Logistic Regression, Decision Tree, Random Forest)
Week 11-12: End-to-End Projects on Kaggle

Where to Practice and Find Datasets

The best way to get better at Data Science is to work on real problems every day:

Kaggle (kaggle.com): Thousands of free datasets, competitions, and free notebooks
UCI Machine Learning Repository (archive.ics.uci.edu): Classic benchmark datasets
Google Dataset Search (datasetsearch.research.google.com): A Google search engine for datasets
Seaborn built-in datasets: tips, titanic, iris, diamonds, penguins
Scikit-Learn built-in datasets: load_iris(), load_boston(), load_diabetes()

Conclusion

Python is not just a programming language. It is a gateway to one of the most exciting and fastest-growing fields in the world today. Throughout this tutorial, you have walked through every important step of the Data Science journey, from setting up your Python environment to writing your first Machine Learning model. You learned how to store and manipulate data using NumPy and Pandas, how to turn raw numbers into beautiful and meaningful charts using Matplotlib and Seaborn, and how to clean, preprocess, and prepare data for powerful predictions using Scikit-Learn.

The best part about Python for Data Science is that you do not need to be a genius or a math expert to get started. All you need is curiosity, consistency, and the willingness to practice every single day. Every expert data scientist you see today was once a complete beginner who did not know what a variable was.

Now it is your turn. Pick a dataset from Kaggle, open your Jupyter Notebook or Google Colab, and start applying everything you have learned here. Build projects, make mistakes, fix them, and build again. Each project you complete will teach you more than any tutorial ever can.

Data Science with Python is a skill that opens doors to incredible career opportunities, better decision-making, and the ability to solve real-world problems with confidence. Your journey starts today.

E&ICT Academy, IIT Roorkee Programs

Getting Started With Python for Data Science

Getting Started With Python for Data Science

What is Data Science?

Why Python for Data Science?

Setting Up Your Python Environment

Option 1: Install Anaconda

Option 2: Use Google Colab (No Installation Required)

Installing Individual Libraries

Python Basics: The Building Blocks

Printing Output

Variables

Data Types

Arithmetic Operators

Comparison Operators

Logical Operators

Data Structures: Storing Multiple Values

Lists

Useful list methods:

Tuples

Dictionaries

Sets

Control Flow: Making Decisions and Repeating Tasks

If-Elif-Else Statements

For Loops

Looping with range():

Looping through a list with an index:

While Loops

Break and Continue

List Comprehension

Functions: Write Once, Use Many Times

Defining and Calling Functions

Functions that Return Values

Default Parameters

*args and **kwargs

Lambda Functions

Python for Exploratory Data Analysis (EDA)

Data Preprocessing: Cleaning and Preparing Data for Models

Step 1: Handle Missing Values

Step 2: Encode Categorical Variables

Label Encoding:

One-Hot Encoding:

Step 3: Feature Scaling (Normalization and Standardization)

Standardization (Z-score scaling):

Normalization (Min-Max Scaling):

Step 4: Feature Selection

Machine Learning with Scikit-Learn

The Universal ML Workflow

Train-Test Split

Linear Regression (Predicting a Continuous Number)

Logistic Regression (Predicting Yes or No)

Decision Tree Classifier

Random Forest Classifier

K-Nearest Neighbors (KNN)

Support Vector Machine (SVM)

Model Evaluation Metrics

Cross Validation (Testing Your Model More Reliably)

Hyperparameter Tuning

Building a Complete End-to-End Data Science Project

Important Python Libraries Cheat Sheet

Your Step-by-Step Learning Roadmap

Where to Practice and Find Datasets

Conclusion

Special Offer

100% Placement

Talk To Our Counselor

Thank You

Special Offer

Talk To Our Counselor

Thank You