Getting Started With SQL in Data Science

SQL is one of the most important tools in data science. Whether you are a student, a fresher, or someone who just wants to understand how data works, this tutorial will teach you SQL from zero to hero - in the simplest words possible.

Think of SQL like a language you use to talk to a database. Just like you use English to ask your teacher a question, you use SQL to ask a database for data. By the end of this tutorial, you will know how to store data, retrieve it, filter it, sort it, combine it, and even analyze it for real data science projects.

What is SQL?

SQL stands for Structured Query Language. It is a programming language designed specifically to manage and retrieve data from databases. SQL is used by data scientists, data analysts, software developers, and business intelligence teams all around the world. It is not just a skill - it is a superpower for anyone working with data.

Here is a simple analogy. Imagine a huge school library with thousands of books. Now imagine you want to find all books written by a specific author. You walk up to the librarian and say, "Please give me all books by Author X." In SQL, that conversation is replaced by a simple command. SQL is that librarian - fast, precise, and always available.

SQL is not case-sensitive, which means writing SELECT and select both work the same way. However, by convention, SQL keywords are always written in UPPERCASE to make the code easier to read.

Why SQL is Important in Data Science

Before going deeper into how to write SQL, it is important to understand why data scientists need SQL. In real-world data science jobs, data is almost never in a neat Excel file. It is stored in large databases - sometimes containing millions or billions of rows. SQL is the standard tool to access and work with this data efficiently.

Here is why every data scientist must know SQL:

Data lives in databases. Most companies store their customer, sales, and operational data in relational databases.
SQL handles big data well. SQL databases are optimized to run queries across millions of records in seconds.
It is easy to learn. SQL reads almost like English, making it one of the most beginner-friendly programming tools.
It is widely used. SQL is consistently ranked as one of the top skills required in data science job listings.
It works everywhere. SQL works in MySQL, PostgreSQL, SQL Server, Oracle, SQLite, and many more platforms.

Understanding Databases and Tables

Before writing your first SQL query, you need to understand two things: databases and tables.

A database is like a big folder on a computer that holds all the data of a company or application. Inside that folder, data is organized into tables. A table is just like a spreadsheet - it has rows and columns.

For example, a school database might have:

A table called Students (with columns like Name, Age, Grade, City)
A table called Marks (with columns like StudentID, Subject, Score)
A table called Teachers (with columns like TeacherID, Name, Subject)

Each row in a table is one record. Each column is one type of information. Together, rows and columns make up the data.

The type of database SQL works with is called a Relational Database. "Relational" means the tables in the database can be connected to each other using common columns (called keys). For example, the Students table and the Marks table can be connected using the StudentID column.

Setting Up Your SQL Environment

To practice SQL, you need a tool where you can write and run queries. Here are the most popular options:

MySQL - Free, widely used, great for beginners. Download from [mysql.com].
PostgreSQL - Free and very powerful. Great for data science use cases.
SQLite - Lightweight, no installation needed, works inside Python.
DB Fiddle / SQLiteOnline - Online tools where you can practice SQL in your browser with zero installation.

For beginners, the fastest way to start is to go to sqliteonline.com or sqlfiddle.com and start writing queries right in your browser.

SQL Command Types

SQL commands are divided into five main categories. Think of them as five different jobs SQL can do:

Category	Full Name	What It Does	Example Commands
DDL	Data Definition Language	Creates and changes the database structure	CREATE, ALTER, DROP, TRUNCATE
DML	Data Manipulation Language	Adds, updates, and deletes data	INSERT, UPDATE, DELETE
DQL	Data Query Language	Retrieves data	SELECT
DCL	Data Control Language	Controls access to data	GRANT, REVOKE
TCL	Transaction Control Language	Manages transactions	COMMIT, ROLLBACK, SAVEPOINT

For data science, you will mostly use DQL (SELECT) and some DML commands. Let's now learn each one step by step.

Creating Your First Database and Table (DDL)

Let's start from the very beginning - creating a database and a table.

CREATE DATABASE

CREATE DATABASE school_db;

This one line creates a brand new database called school_db. It is like creating a new folder on your computer.

USE (Select a Database)

Before you can work inside a database, you need to tell SQL which one to use:

USE school_db;

CREATE TABLE

Now let's create a Students table inside this database:


CREATE TABLE Students (
    StudentID   INT PRIMARY KEY,
    Name        VARCHAR(100),
    Age         INT,
    City        VARCHAR(50),
    Grade       VARCHAR(10)
);

Let's understand each part:

StudentID INT PRIMARY KEY - This is a number column. PRIMARY KEY means every student has a unique ID. No two students can have the same ID.
VARCHAR(100) - This stores text up to 100 characters.
INT - This stores whole numbers.

SQL Data Types

Data types tell SQL what kind of data a column will hold. Here are the most common ones:

Data Type	What It Stores	Example
INT	Whole numbers	1, 25, 1000
VARCHAR(n)	Text up to n characters	"Rahul", "Delhi"
FLOAT / DECIMAL	Decimal numbers	9.5, 23.75
DATE	Date values	2024-04-01
BOOLEAN	True or False values	TRUE, FALSE

Inserting Data (DML – INSERT)

Now that the table exists, let's fill it with data using the INSERT INTO command.


INSERT INTO Students (StudentID, Name, Age, City, Grade)
VALUES (1, 'Rahul Sharma', 17, 'Delhi', 'A');

INSERT INTO Students (StudentID, Name, Age, City, Grade)
VALUES (2, 'Priya Verma', 16, 'Mumbai', 'B');

INSERT INTO Students (StudentID, Name, Age, City, Grade)
VALUES (3, 'Aman Singh', 18, 'Bangalore', 'A');

INSERT INTO Students (StudentID, Name, Age, City, Grade)
VALUES (4, 'Sneha Gupta', 17, 'Delhi', 'C');

INSERT INTO Students (StudentID, Name, Age, City, Grade)
VALUES (5, 'Rohit Jain', 16, 'Chennai', 'B');

Each INSERT statement adds one new row of data into the table. Think of it as filling one new row in a register.

Retrieving Data (DQL – SELECT)

The SELECT statement is the most important SQL command in data science. It is how you ask the database to show you data.

Select All Columns

SELECT * FROM Students;

The * means "all columns." This will show you all rows and all columns from the Students table.

Output:

StudentID	Name	Age	City	Grade
1	Rahul Sharma	17	Delhi	A
2	Priya Verma	16	Mumbai	B
3	Aman Singh	18	Bangalore	A
4	Sneha Gupta	17	Delhi	C
5	Rohit Jain	16	Chennai	B

Select Specific Columns

SELECT Name, City FROM Students;

This shows only the Name and City columns. In data science, you often only need a few columns, not all of them.

DISTINCT – Remove Duplicates

SELECT DISTINCT City FROM Students;

DISTINCT removes duplicate values. In the output above, Delhi appears twice. Using DISTINCT, it will appear only once.

Output:

CityDelhiMumbaiBangaloreChennai

Filtering Data with WHERE

The WHERE clause lets you filter rows based on a condition. It is like saying,"Show me only the students who live in Delhi."

SELECT * FROM Students WHERE City = 'Delhi';

Output:

StudentID	Name	Age	City	Grade
1	Rahul Sharma	17	Delhi	A
4	Sneha Gupta	17	Delhi	C

Comparison Operators in WHERE

You can use many types of conditions in WHERE:

Operator	Meaning	Example
=	Equal to	City = 'Delhi'
!= or <>	Not equal to	Grade != 'C'
>	Greater than	Age > 16
<	Less than	Age < 18
>=	Greater than or equal to	Age >= 17
<=	Less than or equal to	Age <= 17

AND, OR, NOT

You can combine multiple conditions:

-- Students from Delhi with Grade ASELECT * FROM StudentsWHERE City = 'Delhi' AND Grade = 'A';
-- Students from Delhi or MumbaiSELECT * FROM StudentsWHERE City = 'Delhi' OR City = 'Mumbai';
-- Students not from DelhiSELECT * FROM StudentsWHERE NOT City = 'Delhi';

IN Operator

IN is a shortcut for checking multiple values at once:

SELECT * FROM StudentsWHERE City IN ('Delhi', 'Mumbai', 'Chennai');

BETWEEN Operator

BETWEEN filters values within a range:

SELECT * FROM StudentsWHERE Age BETWEEN 16 AND 17;

LIKE Operator

LIKE is used for pattern matching in text. It uses two wildcards:

% means zero or more characters
_ means exactly one character


-- Find students whose name starts with 'R' 
SELECT * FROM Students 
WHERE Name LIKE 'R%'; 

-- Find students whose name has exactly 5 characters 
SELECT * FROM Students 
WHERE Name LIKE '_____';

Sorting Data with ORDER BY

ORDER BY sorts your results. You can sort in ascending (A to Z, 1 to 100) or descending order (Z to A, 100 to 1).


-- Sort by Age from youngest to oldest (ASC is default)
SELECT * FROM Students
ORDER BY Age ASC;

-- Sort by Name in reverse alphabetical order
SELECT * FROM Students
ORDER BY Name DESC;

Limiting Results with LIMIT

In real data science, tables can have millions of rows. You don't want to load all of them every time. LIMIT lets you see only a few rows:


-- Show only the first 3 students
SELECT * FROM Students
LIMIT 3;

This is especially useful when you are exploring a new dataset and want to quickly see the first few rows - similar to df.head() in Python Pandas.

Updating and Deleting Data (DML)

UPDATE – Change Existing Data

The UPDATE command changes the value of an existing row:

UPDATE Students SET Grade = 'A' WHERE StudentID = 4;

This changes Sneha's grade from C to A. Always use WHERE with UPDATE, or else it will change ALL rows!

DELETE – Remove Rows

DELETE FROM Students WHERE StudentID = 5;

This removes Rohit's record. Again, always use WHERE with DELETE, or it will delete the entire table's data.

Aggregate Functions

Aggregate functions are used to calculate values across multiple rows. These are extremely important in data science for summarizing and analyzing data.

The five main aggregate functions are:

Function	What It Does	Example
COUNT()	Counts the number of rows	COUNT()
SUM()	Adds up all values in a column	SUM(Score)
AVG()	Calculates the average	AVG(Age)
MAX()	Finds the highest value	MAX(Score)
MIN()	Finds the lowest value	MIN(Age)

Let's create a Marks table and use these functions:


CREATE TABLE Marks (
    MarkID    INT PRIMARY KEY,
    StudentID INT,
    Subject   VARCHAR(50),
    Score     INT
);

INSERT INTO Marks VALUES (1, 1, 'Math', 85);
INSERT INTO Marks VALUES (2, 2, 'Math', 72);
INSERT INTO Marks VALUES (3, 3, 'Math', 91);
INSERT INTO Marks VALUES (4, 4, 'Math', 65);
INSERT INTO Marks VALUES (5, 5, 'Math', 78);

Now use aggregate functions:

-- Total number of recordsSELECT COUNT(*) AS TotalStudents FROM Marks;
-- Average score in MathSELECT AVG(Score) AS AverageScore FROM Marks;
-- Highest and lowest scoreSELECT MAX(Score) AS Highest, MIN(Score) AS Lowest FROM Marks;
-- Total of all scoresSELECT SUM(Score) AS TotalScore FROM Marks;

The keyword AS gives the output column a custom name. This is called an alias. For example, COUNT(*) AS TotalStudents names the output column "TotalStudents."

GROUP BY and HAVING

GROUP BY

GROUP BY groups rows that have the same value and then applies an aggregate function on each group. Think of it as organizing data into categories and then calculating something for each category.

Let's add more data to the Marks table:

INSERT INTO Marks VALUES (6, 1, 'Science', 90);INSERT INTO Marks VALUES (7, 2, 'Science', 68);INSERT INTO Marks VALUES (8, 3, 'Science', 88);

Now find the average score per subject:

SELECT Subject, AVG(Score) AS AvgScoreFROM MarksGROUP BY Subject;

Output:

Subject	AvgScore
Math	78.2
Science	82.0

HAVING

HAVING is like WHERE, but it works after GROUP BY. You use it to filter grouped results.

-- Show only subjects where average score is above 80SELECT Subject, AVG(Score) AS AvgScoreFROM MarksGROUP BY SubjectHAVING AVG(Score) > 80;

Key rule: Use WHERE to filter rows before grouping. Use HAVING to filter groups after grouping.

SQL JOINs - Combining Tables

This is one of the most powerful features of SQL. JOINs are used to combine data from two or more tables using a common column.

Think of it like this: You have a table of students with their IDs and names, and another table of marks with student IDs and scores. A JOIN brings these two tables together so you can see each student's name along with their score.

There are four main types of JOINs:

INNER JOIN

Returns only the rows that have matching values in both tables.

SELECT Students.Name, Marks.Subject, Marks.ScoreFROM StudentsINNER JOIN Marks ON Students.StudentID = Marks.StudentID;

Output: Only students who have a mark entry will appear.

LEFT JOIN

Returns all rows from the left (first) table, and matching rows from the right table. If no match, the right side shows NULL.

SELECT Students.Name, Marks.Subject, Marks.ScoreFROM StudentsLEFT JOIN Marks ON Students.StudentID = Marks.StudentID;

Even if a student has no marks, they will still appear in the results.

RIGHT JOIN

The opposite of LEFT JOIN. Returns all rows from the right table and matching rows from the left.

SELECT Students.Name, Marks.Subject, Marks.ScoreFROM StudentsRIGHT JOIN Marks ON Students.StudentID = Marks.StudentID;

FULL OUTER JOIN

Returns all rows from both tables. Where there is no match, NULLs are shown.

SELECT Students.Name, Marks.Subject, Marks.ScoreFROM StudentsFULL OUTER JOIN Marks ON Students.StudentID = Marks.StudentID;

Visual Summary of JOINs

JOIN Type	What it Returns
INNER JOIN	Only matching rows from both tables
LEFT JOIN	All rows from the left table + matching rows from the right
RIGHT JOIN	All rows from the right table + matching rows from the left
FULL OUTER JOIN	All rows from both tables

Subqueries – Queries Inside Queries

A subquery is a query written inside another query. It is like asking a question inside a question. You put the inner query inside parentheses ().

-- Find students whose score is above the average scoreSELECT StudentID, ScoreFROM MarksWHERE Score > (SELECT AVG(Score) FROM Marks);

Here, the inner query (SELECT AVG(Score) FROM Marks) runs first and finds the average score. Then the outer query uses that average to filter results.

Subquery in FROM Clause

You can also use a subquery as a temporary table:


SELECT Name, Score
FROM Students
JOIN (
    SELECT StudentID, MAX(Score) AS Score
    FROM Marks
    GROUP BY StudentID
) AS TopScores
ON Students.StudentID = TopScores.StudentID;

This finds the highest score for each student and shows it alongside their name.

CASE Statement – IF-ELSE in SQL

The CASE statement works like an IF-ELSE condition in other programming languages. It lets you create conditional logic inside a query.


SELECT Name, Grade,
    CASE
        WHEN Grade = 'A' THEN 'Excellent'
        WHEN Grade = 'B' THEN 'Good'
        WHEN Grade = 'C' THEN 'Average'
        ELSE 'Needs Improvement'
    END AS Performance
FROM Students;

Output:

Name	Grade	Performance
Rahul Sharma	A	Excellent
Priya Verma	B	Good
Aman Singh	A	Excellent
Sneha Gupta	C	Average

In data science, CASE statements are very useful to create new categories or labels from existing data - a process called feature engineering.

SQL String Functions

SQL has built-in functions to work with text data. These are very useful when cleaning and preparing data for analysis.


-- Convert name to uppercase
SELECT UPPER(Name) FROM Students;

-- Convert name to lowercase
SELECT LOWER(Name) FROM Students;

-- Get length of the name
SELECT Name, LENGTH(Name) AS NameLength FROM Students;

-- Extract first 5 characters
SELECT SUBSTRING(Name, 1, 5) AS ShortName FROM Students;

-- Replace a word in text
SELECT REPLACE(City, 'Delhi', 'New Delhi') FROM Students;

-- Remove spaces from beginning and end
SELECT TRIM('  Rahul  ') AS CleanName;

-- Combine two columns into one
SELECT CONCAT(Name, ' - ', City) AS NameCity FROM Students;

SQL Date and Time Functions

Working with dates is a very common task in data science. Here are some useful date functions:


-- Get today's date
SELECT CURDATE();

-- Get current date and time
SELECT NOW();

-- Extract year from a date
SELECT YEAR('2024-04-01') AS YearOnly;

-- Extract month
SELECT MONTH('2024-04-01') AS MonthOnly;

-- Calculate difference between two dates
SELECT DATEDIFF('2024-12-31', '2024-01-01') AS DaysBetween;

Date functions are especially useful when analyzing time-series data, sales trends, or user activity logs.

NULL Values in SQL

A NULL value means there is no data in that cell. It is not zero. It is not an empty string. It simply means "no value."

In data science, you will often deal with missing or NULL data. Here is how to handle it in SQL:


-- Find students with no Grade assigned
SELECT * FROM Students
WHERE Grade IS NULL;

-- Find students who have a Grade
SELECT * FROM Students
WHERE Grade IS NOT NULL;

-- Replace NULL with a default value using COALESCE
SELECT Name, COALESCE(Grade, 'Not Assigned') AS Grade
FROM Students;

COALESCE returns the first non-NULL value. It is very useful for cleaning messy datasets.

Common Table Expressions (CTEs)

A CTE (Common Table Expression) is a temporary result set that you can reference within a query. It is written using the WITH keyword and makes your queries much easier to read and maintain.


WITH TopStudents AS (
    SELECT StudentID, AVG(Score) AS AvgScore
    FROM Marks
    GROUP BY StudentID
    HAVING AVG(Score) >= 80
)
SELECT Students.Name, TopStudents.AvgScore
FROM Students
JOIN TopStudents ON Students.StudentID = TopStudents.StudentID;

Think of a CTE like giving a name to a mini-query so you can reuse it. CTEs are much cleaner than writing long nested subqueries. In data science projects, CTEs are used heavily for step-by-step data transformation.

Window Functions - Advanced Data Analysis

Window functions are one of the most powerful SQL features for data science. Unlike aggregate functions that collapse all rows into one result, window functions calculate values across rows while keeping each row separate.

You use window functions with the OVER() clause:


-- Rank students by score within each subject
SELECT
    StudentID,
    Subject,
    Score,
    RANK() OVER (PARTITION BY Subject ORDER BY Score DESC) AS SubjectRank
FROM Marks;

Here, PARTITION BY Subject creates a separate window for each subject, and ORDER BY Score DESC ranks within that window.

Common Window Functions

Function	What It Does
ROW_NUMBER()	Assigns a unique sequential number to each row
RANK()	Ranks rows; ties get the same rank with gaps
DENSE_RANK()	Ranks rows; ties get the same rank without gaps
SUM() OVER()	Running total across rows
AVG() OVER()	Moving average across rows
LAG()	Access the value from the previous row
LEAD()	Access the value from the next row

Running Total Example

SELECT StudentID, Score, SUM(Score) OVER (ORDER BY StudentID) AS RunningTotalFROM Marks;

This gives a running (cumulative) total of scores. This is very useful in financial data analysis and time-series analytics.

Views - Saving Queries as Virtual Tables

A VIEW is a saved SQL query that you can treat like a real table. Once you create a view, you can query it anytime without writing the original query again.


-- Create a view that shows student names with their scores
CREATE VIEW StudentScores AS
SELECT Students.Name, Marks.Subject, Marks.Score
FROM Students
INNER JOIN Marks ON Students.StudentID = Marks.StudentID;

-- Use the view like a table
SELECT * FROM StudentScores
WHERE Score > 80;

Views are useful in data science when you have a complex join or transformation that you need to use repeatedly. Instead of writing the same 20-line query every day, you just call the view.

Indexes - Making SQL Faster

When a table has millions of rows, searching through all of them is slow. An INDEX is a special data structure that speeds up search queries.

-- Create an index on the City columnCREATE INDEX idx_city ON Students(City);

Now when you run WHERE City = 'Delhi', SQL will use the index to find results much faster - just like an index at the back of a textbook.

When to use an index:

On columns you frequently use in WHERE clauses
On columns used in JOIN conditions
On columns used in ORDER BY

When not to over-index:

Too many indexes slow down INSERT, UPDATE, and DELETE operations
Index only the columns you really need

SQL for Real Data Science Projects

Now that you know all the key SQL concepts, let's see how they come together in a real data science workflow.

Step 1 – Understand the Data

Before writing any query, explore the tables:


-- See all tables in the database
SHOW TABLES;

-- See the structure of a table
DESCRIBE Students;

-- Quick preview of the data
SELECT * FROM Students LIMIT 5;

Step 2 – Data Profiling

Check for missing values, duplicates, and data quality issues:


-- Count total rows
SELECT COUNT(*) FROM Students;

-- Count rows with NULL City
SELECT COUNT(*) FROM Students WHERE City IS NULL;

-- Check for duplicate names
SELECT Name, COUNT(*) AS Count
FROM Students
GROUP BY Name
HAVING COUNT(*) > 1;

Step 3 - Data Cleaning

Fix or handle bad data:


-- Replace NULLs in Grade with 'Unknown'
UPDATE Students SET Grade = 'Unknown' WHERE Grade IS NULL;

-- Delete duplicate rows (keep the one with lower ID)
DELETE FROM Students
WHERE StudentID NOT IN (
    SELECT MIN(StudentID)
    FROM Students
    GROUP BY Name, City
);

Step 4 - Exploratory Data Analysis (EDA)

Use SQL to answer business questions:


-- How many students are from each city?
SELECT City, COUNT(*) AS StudentCount
FROM Students
GROUP BY City
ORDER BY StudentCount DESC;

-- What is the average score per grade?
SELECT s.Grade, AVG(m.Score) AS AvgScore
FROM Students s
JOIN Marks m ON s.StudentID = m.StudentID
GROUP BY s.Grade
ORDER BY AvgScore DESC;

Step 5 – Feature Engineering

Create new meaningful columns for your analysis:


SELECT
    StudentID,
    Score,
    CASE
        WHEN Score >= 90 THEN 'High Performer'
        WHEN Score >= 75 THEN 'Average Performer'
        ELSE 'Low Performer'
    END AS PerformanceCategory
FROM Marks;

SQL vs Python for Data Science

Many beginners ask: "Should I use SQL or Python (Pandas) for data science?" The answer is: both are needed, and they work together.

Task	SQL	Python (Pandas)
Querying large databases	✅ Better	❌ Slower
Data cleaning	✅ Good	✅ Better
Machine learning models	❌ Not for this	✅ Better
Visualization	❌ Limited	✅ Better
Joins and aggregations	✅ Excellent	✅ Good
Working with billions of rows	✅ Built for this	❌ Struggles

In a typical data science project, you use SQL to extract and prepare the data from a database, and then use Python or R to model, analyze, and visualize it.

SQL Best Practices for Data Scientists

Following good habits makes your SQL code cleaner, faster, and easier to understand.

Always use WHERE with UPDATE and DELETE - Without it, you will accidentally change or delete all rows.
Use aliases - SELECT AVG(Score) AS AvgScore is much clearer than just AVG(Score).
Avoid SELECT * in production - Always specify the columns you need. It is faster and more intentional.
Use CTEs for complex queries - CTEs are much easier to read than deeply nested subqueries.
Comment your code - Use -- to add comments and explain what your query does.
Test on small data first - Use LIMIT to preview results before running a query on millions of rows.
Understand your data before querying - Use DESCRIBE and SELECT * LIMIT 5 to explore before diving deep.
Use indexes on large tables - Indexes make filtering and joining on large tables dramatically faster.

SQL Learning Roadmap

Here is a clear roadmap to go from beginner to advanced in SQL for data science:

Level 1 – Beginner (Week 1-2)

Understand databases and tables
Learn SELECT, WHERE, ORDER BY, LIMIT
Practice INSERT, UPDATE, DELETE
Learn data types and constraints

Level 2 – Intermediate (Week 3-4)

Master aggregate functions (COUNT, SUM, AVG, MAX, MIN)
Learn GROUP BY and HAVING
Understand all four types of JOINs
Practice LIKE, IN, BETWEEN, NULL handling

Level 3 – Advanced (Week 5-6)

Write subqueries and nested queries
Master CTEs (WITH clause)
Learn window functions (RANK, ROW_NUMBER, LAG, LEAD)
Understand views and indexes
Practice on real datasets (Kaggle, public databases)

Level 4 – Expert (Week 7+)

Stored procedures and triggers
Performance optimization and execution plans
Working with large datasets and partitioned tables
Connecting SQL with Python (using pandas.read_sql() or SQLAlchemy)

Practice Problems

The best way to learn SQL is by practicing. Here are some exercises you can try on the Students and Marks tables:

Find all students who scored more than 80 marks.
Find the student with the highest average score across all subjects.
Count how many students are in each grade category (A, B, C).
Find all students who have NOT submitted any marks (use LEFT JOIN).
Display each student's name along with their rank based on their total score.
Find the top 2 scoring students in each subject using window functions.
Calculate the percentage of students from each city out of the total.
Find students whose name starts with 'R' and score is between 70 and 90.

You can practice these on SQLZoo, DataLemur, or SQLBolt for free.

Quick Reference: Most Used SQL Commands

Here is a handy cheat sheet of the most important SQL commands in data science:


-- Select data
SELECT column1, column2 FROM table_name;

-- Filter data
WHERE condition;

-- Sort data
ORDER BY column ASC/DESC;

-- Limit rows
LIMIT n;

-- Remove duplicates
SELECT DISTINCT column FROM table;

-- Aggregate functions
COUNT(*), SUM(col), AVG(col), MAX(col), MIN(col)

-- Group data
GROUP BY column HAVING condition;

-- Join tables
INNER JOIN table2 ON table1.id = table2.id;

-- Subquery
WHERE column > (SELECT AVG(column) FROM table);

-- CTE
WITH cte_name AS (SELECT ...) SELECT * FROM cte_name;

-- Window function
RANK() OVER (PARTITION BY col ORDER BY col2 DESC)

-- Create view
CREATE VIEW view_name AS SELECT ...;

-- Handle NULLs
COALESCE(column, 'default')

-- Conditional logic
CASE WHEN ... THEN ... ELSE ... END

Conclusion

SQL is not just a tool - it is the language of data. Every data scientist, no matter how advanced, uses SQL daily to explore, clean, and prepare data. It is simple enough for beginners to learn in a week, yet powerful enough for experts to use on billion-row datasets. Start with the basics, practice on real data, and keep building complexity step by step. The more you practice SQL, the more confident and capable you will become as a data scientist.

E&ICT Academy, IIT Roorkee Programs