
If you are preparing for a career in data analytics, mastering SQL (Structured Query Language) is absolutely essential. SQL is the backbone of data analysis, enabling analysts to query, manipulate, and manage large datasets across industries. In interviews at top companies like Google, Meta, Microsoft, and Amazon, recruiters often evaluate candidates on both theoretical SQL concepts such as normalization, keys, and joins and practical SQL queries, like filtering, aggregations, subqueries, and window functions.
To help you succeed, we’ve compiled the top 50 SQL interview questions and detailed answers for Data Analyst freshers. This comprehensive guide will boost your confidence, strengthen your SQL fundamentals, and ensure you’re ready to tackle even the toughest interview challenges.
SQL Interview Questions and Answers for Data Analyst Freshers
- What is SQL, and what is its purpose in data analysis?
Answer: SQL (Structured Query Language) is the standard language used to communicate with and manage relational databases. It allows users to create, read, update, and delete data by writing queries in a declarative syntax. In data analysis, SQL is essential because it enables analysts to efficiently retrieve specific information from large datasets, join multiple tables to gather related data, and perform aggregations (like sums or averages) to derive insights. By using SQL, data analysts can filter data to find relevant records, calculate summary statistics, and prepare data for reporting or visualization. In essence, SQL provides a powerful, consistent way to interact with structured data and is crucial for extracting meaningful insights from databases in a reliable manner. - What is the difference between SQL and NoSQL databases?
Answer: SQL databases are relational databases that store data in structured tables with predefined schemas (table structures). They use SQL for defining and manipulating data, and they excel at ensuring data consistency and enforcing relationships (keys) between tables. SQL databases are ideal for structured data and complex queries, often supporting ACID properties for reliable transactions. In contrast, NoSQL databases are non-relational and can store unstructured or semi-structured data without a fixed schema. NoSQL systems (like document stores, key-value stores, column-family stores, or graph databases) offer flexibility and horizontal scalability – they handle large volumes of data and high throughput by distributing data across many servers. However, NoSQL databases often sacrifice some strict consistency in favor of eventual consistency to achieve this scalability. In summary, SQL databases emphasize structured data and complex querying with strong consistency, while NoSQL databases provide flexibility and scalability for varied data types, making them suitable for big data and real-time web applications. - What are the different types of SQL statements (DDL, DML, DCL, TCL)?
Answer: SQL commands are grouped into categories based on their purpose:- DDL (Data Definition Language): Used to define or modify the schema of the database. DDL commands create or alter the structure of database objects. For example:
CREATE
,ALTER
,DROP
(e.g.,CREATE TABLE...
,ALTER TABLE...
) are DDL commands that define new tables or change existing ones. - DML (Data Manipulation Language): Used for data manipulation within tables. These commands query or change the data itself. Examples include
SELECT
(to retrieve data),INSERT
(to add new rows),UPDATE
(to modify existing rows), andDELETE
(to remove rows). - DCL (Data Control Language): Used to control access to data in the database. DCL commands manage permissions and security. For example:
GRANT
(to give a user access rights to database objects) andREVOKE
(to remove access rights) are DCL commands. - TCL (Transaction Control Language): Used to manage transactions in the database. These commands ensure the integrity of groups of operations. Examples:
COMMIT
(to save all changes made in the current transaction),ROLLBACK
(to undo changes in the current transaction), andSAVEPOINT
(to set a save point within a transaction). Each of these ensures that sequences of operations either complete fully or have no effect, maintaining database consistency.
(There is also DQL – Data Query Language – which mainly refers to the SELECT statement, but in many classifications SELECT is included under DML.)
- DDL (Data Definition Language): Used to define or modify the schema of the database. DDL commands create or alter the structure of database objects. For example:
- Explain Primary Key and Foreign Key in the context of a relational database.
Answer: A Primary Key is a column (or a set of columns) in a table that uniquely identifies each row of that table. Primary keys enforce entity integrity by ensuring no two rows have the same primary key value, and a primary key value cannot be NULL. For example, a tableEmployees
might use anemployee_id
as its primary key so that each employee record can be uniquely retrieved. Primary keys are often indexed to allow fast lookup of records.
A Foreign Key is a column (or set of columns) in one table that establishes a link to a Primary Key in another table. In other words, a foreign key in tableA
refers to the primary key of tableB
, creating a parent-child relationship between the two tables. This enforces referential integrity: you cannot have a foreign key value in the child table that doesn’t exist as a primary key in the parent table. For example, anOrders
table might have a foreign keycustomer_id
that references theCustomers
table’s primary keycustomer_id
. This relationship ensures that every order is associated with a valid customer.
Differences: The primary key is about uniquely identifying records within its own table, while a foreign key is about ensuring a relationship between tables. A table can have only one primary key (though it may be composite, involving multiple columns), but it can have multiple foreign keys linking to different parent tables. Together, primary and foreign keys help maintain a consistent and structured relational database design by linking records and preventing invalid data (e.g., orders for non-existent customers). - How do you retrieve all columns from a table named
sales
?
Answer: To retrieve all columns from a table in SQL, you use theSELECT *
syntax. The asterisk (*
) is a wildcard that represents all columns. For example, the query to get all columns and all rows from a table namedsales
would be:SELECT * FROM sales;
This statement will return every column for every record in thesales
table. WhileSELECT *
is convenient, in practice it’s often recommended to specify column names in production queries (e.g.,SELECT column1, column2 FROM sales;
) to avoid fetching unnecessary data and to make the result set and code more explicit. - Write a SQL query to select distinct values from a column named
product_category
.
Answer: To get distinct (unique) values from a specific column, use theDISTINCT
keyword in the SELECT statement. For example, if you want uniqueproduct_category
values from thesales
table, you would query:SELECT DISTINCT product_category FROM sales;
This query scans theproduct_category
column and returns each unique value only once. If duplicates exist in that column, they will be filtered out in the results. (Note: You can select distinct combinations of multiple columns by listing them together afterSELECT DISTINCT
.) - How do you filter records in SQL? Provide an example using a WHERE clause.
Answer: In SQL, theWHERE
clause is used to filter records based on specified conditions. It limits the rows returned by a query to those that meet the condition. For example, suppose we have asales
table and we want to find all sales with an amount greater than 1000. We would write:SELECT * FROM sales WHERE amount > 1000;
This query will return only the rows where theamount
column value is greater than 1000. Multiple conditions can be combined usingAND
(for when all conditions must be true) orOR
(for when at least one condition is true). For instance:SELECT * FROM sales WHERE amount > 1000 AND region = 'North';
would retrieve sales records from the North region with amount over 1000. TheWHERE
clause is fundamental for filtering data in SELECT, UPDATE, and DELETE statements. - Explain the
GROUP BY
clause with an example.
Answer: TheGROUP BY
clause in SQL is used to arrange rows that have the same values in specified columns into aggregate groups. It’s commonly used in conjunction with aggregate functions (COUNT
,SUM
,AVG
,MAX
,MIN
, etc.) to perform calculations on each group of rows. Essentially,GROUP BY
collates rows by one or more column values, so that aggregate functions apply to each group rather than the whole result set.
Example: Suppose we have asales
table with columnsproduct_category
andamount
, and we want to calculate the total sales amount for each product category. We would write:SELECT product_category, SUM(amount) AS total_sales FROM sales GROUP BY product_category;
This query groups the rows by each uniqueproduct_category
and then computes the sum ofamount
for each group. The result will have one row per product category, with the corresponding total sales. WithoutGROUP BY
, theSUM(amount)
in the SELECT would be calculated over all rows, but withGROUP BY product_category
, the sum is reset for each category group. - What is the purpose of the
HAVING
clause, and how is it different fromWHERE
?
Answer: TheHAVING
clause is used to filter groups created by aGROUP BY
clause. It is similar toWHERE
, butWHERE
filters individual rows before grouping, whereasHAVING
filters after the aggregation has been performed. In other words,WHERE
applies conditions to columns in raw data (no aggregates allowed in theWHERE
clause), whileHAVING
applies conditions to aggregated values (the results of functions like SUM, COUNT, etc.).
Example: Consider a tablesales(product_category, amount)
. If we want to find product categories that have more than $5000 in total sales, we would use aHAVING
clause:SELECT product_category, SUM(amount) AS total_sales FROM sales GROUP BY product_category HAVING SUM(amount) > 5000;
Here,SUM(amount) > 5000
is an aggregate condition, so it belongs in HAVING (not in WHERE). The database first groups the sales by category, then computes the sum per category, and finally filters out groups where the sum is not greater than 5000.
If we instead wanted to filter individual rows (for example, consider only sales with amount > 100), that condition would go in theWHERE
clause (e.g.,WHERE amount > 100
) before the grouping step. In summary, useWHERE
to filter raw rows andHAVING
to filter aggregated results. - How do you sort the results of a query in SQL?
Answer: You sort query results using theORDER BY
clause.ORDER BY
allows you to specify one or more columns (or expressions) to sort by, and you can choose ascending order (ASC
) or descending order (DESC
). By default, sorting is in ascending order (smallest to largest values, or A to Z for text) if no order is specified.
Example: If we want to retrieve all sales and sort them by amount from highest to lowest, we would write:SELECT * FROM sales ORDER BY amount DESC;
This will list thesales
records in descending order of theamount
field. We could also sort by multiple columns. For instance,ORDER BY region ASC, amount DESC
would sort first byregion
alphabetically, and for rows with the same region, order them byamount
from highest to lowest. Sorting is often useful for reports or when combined with aLIMIT
clause to get things like “Top N” results. - What is an SQL alias, and how is it used?
Answer: An SQL alias is a temporary name given to a table or column for the duration of a query. Aliases are used to make query results more readable or to shorten long table names (especially in queries involving multiple tables). You create an alias using theAS
keyword (althoughAS
is optional for column aliases in many databases).- Column Alias: Renames a column in the result set. For example:
SELECT first_name AS "First Name", last_name AS "Last Name" FROM employees;
Here, the output will display thefirst_name
column under the heading “First Name” andlast_name
as “Last Name”. The aliases don’t change the actual column names in the table; they only label the output. (Note: Quoting the alias with double quotes or brackets is only necessary if it contains spaces or special characters.) - Table Alias: Renames a table within a query, often to shorten references in joins or subqueries. For example:
SELECT e.name, d.department_name FROM employees AS e JOIN departments AS d ON e.department_id = d.id;
Here,employees
is aliased ase
anddepartments
asd
. This lets us refer to those tables succinctly (e.g.,e.name
instead ofemployees.name
). Table aliases are especially useful when the same table is used more than once (self-join) or when table names are long.
In summary, aliases improve query clarity and are only valid for the duration of that SQL statement.
- Column Alias: Renames a column in the result set. For example:
- What are the different types of JOINs in SQL? Explain each briefly.
Answer: Joins are used to combine rows from two or more tables based on related columns between them. There are several types of SQL joins, each serving a different purpose:- INNER JOIN: Returns only the matching rows from both tables. In an inner join, a row will appear in the result set only if there is a match in all tables being joined. For example, joining
Orders
andCustomers
oncustomer_id
via an inner join will give you only those orders that have a corresponding customer record (and vice versa). Rows with no match in the other table are excluded. - LEFT JOIN (LEFT OUTER JOIN): Returns all rows from the left table (the one mentioned before
LEFT JOIN
), and the matching rows from the right table. If there’s no match in the right table, the result will still include the left table’s row, but with NULLs for the right table’s columns. (Think of this as “take all rows from the first table, and add data from the second table when available.”) - RIGHT JOIN (RIGHT OUTER JOIN): The mirror image of a left join. It returns all rows from the right table, and matched rows from the left table, or NULLs for left table’s columns when there’s no match. Right joins are less common (since you can usually swap table order and use a left join), but conceptually it’s the same idea from the opposite side.
- FULL JOIN (FULL OUTER JOIN): Returns all rows when there is a match in either table. In a full outer join, every row from both tables will appear at least once in the result. Where matches exist, the rows are combined. Where a row in one table has no match in the other, that row still appears, with NULLs for the missing values from the other table. This join is useful to find all records across two datasets, matching what can be matched, and retaining non-matched records from both sides.
- CROSS JOIN: Returns the Cartesian product of the two tables (every row of the first table paired with every row of the second table). A cross join does not require a matching key and will produce
m * n
rows (if the first table has m rows and the second has n). It’s rarely used deliberately, except for specific scenarios (like generating combinations or adding a constant set of values) since it can produce a very large result set if tables are large.
Note: An OUTER JOIN typically refers collectively to left, right, or full joins – any join that can include non-matching rows. An INNER JOIN by contrast includes only matching rows. All these joins are specified in the FROM clause, for example:SELECT ... FROM tableA INNER JOIN tableB ON tableA.key = tableB.key
. Proper indexing on join keys can greatly improve join performance.
- INNER JOIN: Returns only the matching rows from both tables. In an inner join, a row will appear in the result set only if there is a match in all tables being joined. For example, joining
- What is a self-join in SQL? Give an example use case.
Answer: A self-join is a join in which a table is joined with itself. In a self-join, you treat one single table as if it were two tables by using table aliases, allowing you to compare rows within that table. Self-joins are useful for querying hierarchical or related data that exists in one table.
Example use cases:- Hierarchical Relationships: Consider an
Employees
table where each employee row has amanager_id
that references theemployee_id
of their manager (all within the same table). To get a list of employees alongside their manager’s name, you would self-join theEmployees
table to itself: one alias representing the employee, and another alias representing the manager. For instance:SELECT emp.name AS Employee, mgr.name AS Manager FROM Employees AS emp JOIN Employees AS mgr ON emp.manager_id = mgr.employee_id;
Here,Employees
is used twice:emp
is an alias for the employee, andmgr
for the manager. The join condition matches each employee to their manager. - Comparing Rows: Self-joins can also find duplicates or compare rows. For example, in a table of products you might join the table to itself to find pairs of products in the same category with the same price, etc.
In summary, a self-join isn’t a special keyword but a technique where you join a table to itself to meet a query’s needs. It’s particularly common in hierarchical data scenarios (organizational charts, bill of materials, etc.) where rows have parent-child relationships within one table.
- Hierarchical Relationships: Consider an
- What is a subquery in SQL, and how would you use one?
Answer: A subquery (also called an inner query or nested query) is a query embedded within another SQL query. A subquery is typically used to provide a value or set of values to the outer (main) query. You can place subqueries in various parts of an SQL statement, most commonly in theWHERE
clause, theFROM
clause (as a derived table), or theSELECT
clause. The outer query will use the result of the subquery to further filter or compute results.
Example (subquery in WHERE clause): Suppose we have anEmployees
table with asalary
column, and we want a list of employees whose salary is above the average salary of all employees. We can use a subquery to compute the average salary, then use it in the outer query’sWHERE
:SELECT name FROM Employees WHERE salary > (SELECT AVG(salary) FROM Employees);
Here,(SELECT AVG(salary) FROM Employees)
is the subquery, which returns a single value (the average salary). The outer query then compares each employee’s salary to that value.
Example (subquery as derived table in FROM): You can select from a subquery as if it were a table, e.g.,SELECT sub.department, sub.avg_salary FROM ( SELECT department, AVG(salary) AS avg_salary FROM Employees GROUP BY department ) AS sub WHERE sub.avg_salary > 100000;
This finds departments with average salary over 100,000 by first computing averages in the subquery.
Subqueries can return a single value (for a scalar comparison), a list of values (for use withIN
orEXISTS
), or even a whole table (for use in FROM). They are a powerful tool for breaking down complex queries into logical steps or filtering based on aggregated calculations that can’t be directly done in a single query clause. - What is a correlated subquery, and how does it differ from a regular subquery?
Answer: A correlated subquery is a subquery that references one or more columns from the outer query, thereby forming a dependency between the two. Because of this reference, a correlated subquery is re-evaluated for each row of the outer query, rather than executing just once. In contrast, a regular (non-correlated) subquery has no such external references and can be executed independently of the outer query; the outer query then uses its result.
Differences:- Execution: A non-correlated subquery runs completely on its own first, providing a result that the outer query uses (for example, a subquery in the WHERE clause might return a value or list that the outer query then compares against). A correlated subquery, however, runs repeatedly – typically once for each candidate row of the outer query – because it needs values from that outer row to compute its result.
- Syntax clue: In SQL, you can often spot a correlated subquery because it will include a reference to a table alias from the outer query. For example,
SELECT e.employee_id, e.name FROM Employees e WHERE e.salary > ( SELECT AVG(ef.salary) FROM Employees ef WHERE ef.department = e.department );
Heree.department
inside the subquery links back to the outer aliase
. This subquery (which calculates the average salary in the same department as the outer employee) must be recomputed for each employeee
being considered. - Use cases: Correlated subqueries are useful for queries like “find all rows that meet some condition relative to other rows in the same table.” For instance, “find employees whose salary is above the average salary of their department” (as in the example) requires a correlated subquery. A regular subquery could not directly express that without correlation.
- Performance: Because correlated subqueries execute per row, they can be less efficient than equivalent joins or non-correlated subqueries. Sometimes a correlated subquery can be rewritten as a join or an analytic (window) function for better performance.
Summary: Use a correlated subquery when the subquery’s logic naturally depends on the outer query’s current row. If the subquery can run independently and provide a useful result to the outer query, it’s likely better as a non-correlated subquery or another construct. Recognizing this difference helps in understanding and optimizing SQL queries.
- Explain the
UNION
andUNION ALL
operators in SQL.
Answer: TheUNION
andUNION ALL
operators are used to combine the result sets of two (or more)SELECT
queries. They stack one query’s results on top of the other’s, aligning columns. The key difference is how they handle duplicate results:UNION
: Performs a distinct union of the two sets. This means it combines the results and then removes any duplicate rows, so that each row in the final output is unique. Because it must eliminate duplicates,UNION
may involve an extra sorting or hashing step behind the scenes to compare rows from both results.UNION ALL
: Performs a raw concatenation of the results, taking all rows from both queries and not removing duplicates. If the same row appears in the first and second query, it will appear twice in the combined output.UNION ALL
is generally faster thanUNION
because it avoids the overhead of checking for and eliminating duplicates.
Requirements: Both queries combined by a UNION must select the same number of columns, and the corresponding columns should have compatible data types. The column names in the result take from the first query by default.
Example:
SELECT name, region FROM Customers UNION SELECT name, region FROM Prospects;
ThisUNION
returns a list of unique name-region pairs that appear in either the Customers or Prospects table (eliminating any exact duplicates present in both). If we usedUNION ALL
, the result would include all rows from both tables, including any duplicates.
In summary, useUNION
when you need a distinct set of results across multiple queries, and useUNION ALL
when you want to preserve all occurrences (and you know either duplicates can’t happen or you don’t mind them, and you prefer the performance gain). - How can you find duplicate records in a table using SQL?
Answer: One common way to find duplicate records is to group the data by the columns that should be unique and use theCOUNT()
function to identify counts greater than 1. Rows that have a count > 1 for a given key are duplicates. For example, suppose we have a tableOrders(order_id, customer_id, order_date, ...)
and we suspect duplicates in thecustomer_id
andorder_date
combination (meaning a customer might have unintentionally placed identical orders on the same date). We could find duplicates like so:SELECT customer_id, order_date, COUNT(*) AS cnt FROM Orders GROUP BY customer_id, order_date HAVING COUNT(*) > 1;
This will list each(customer_id, order_date)
pair that appears more than once, along with how many times it appears (thecnt
). Those are duplicates according to that criterion.
If you want to see the full rows that are duplicate, you can join this result back to the original table or use a window function. For instance, using a window function approach:SELECT * FROM ( SELECT *, COUNT(*) OVER (PARTITION BY customer_id, order_date) AS cnt FROM Orders ) AS sub WHERE sub.cnt > 1;
This will mark each row with a count of how many identical(customer_id, order_date)
it has, and then filter to those with count > 1.
Alternative approach: Using a self-join – joining the table with itself on the duplicate criteria – can also find duplicates, but the GROUP BY/HAVING method is typically more straightforward.
Once duplicates are identified, you might then decide how to handle them (e.g., delete extras, archive them, etc.). The exact query will depend on what “duplicate” means in context (all columns identical, or just certain key fields identical). - What is the CASE statement in SQL? Provide an example.
Answer: TheCASE
statement in SQL is a conditional expression that allows you to perform IF-THEN-ELSE logic within a query. It evaluates conditions sequentially and returns a result when the first condition is met, or a default result if no condition is met.CASE
is often used in the SELECT list to create computed columns, but it can also be used in ORDER BY, HAVING, etc.
Syntax:CASE WHEN condition1 THEN result1 WHEN condition2 THEN result2 ... ELSE resultN END
- The conditions (
WHEN
clauses) are evaluated in order. The first true condition yields its correspondingTHEN
result.If none of theWHEN
conditions are true, theELSE
result is returned. (IfELSE
is omitted and nothing matches, CASE returns NULL.)
Example: Suppose we have a tableEmployees(name, salary)
and we want to categorize each employee as “High”, “Medium”, or “Low” income based on their salary. We can use a CASE in the SELECT:
SELECT name, CASE WHEN salary >= 100000 THEN 'High Income' WHEN salary >= 50000 THEN 'Medium Income' ELSE 'Low Income' END AS income_level FROM Employees;
This will output each employee’s name along with anincome_level
classification. Employees with salary 100k or above are labeled High Income, those with 50k–99999 are Medium, and the rest are Low Income.
CASE statements are very powerful for creating custom buckets or categories on the fly, handling complex conditional logic, or pivoting data within a query without having to write separate queries for each condition. - The conditions (
- How do you handle NULL values in SQL? (For example, checking for NULL or replacing NULLs)
Answer:NULL
represents an unknown or missing value in SQL. Handling NULLs properly is important because any comparison or arithmetic with NULL generally yields NULL (or false in a WHERE condition). Here are common ways to handle them:- Checking for NULL: Use the
IS NULL
orIS NOT NULL
operators (instead of=
or!=
). For example:SELECT * FROM Customers WHERE phone IS NULL;
finds rows where thephone
column has no value. Similarly,... WHERE phone IS NOT NULL
finds rows where a phone number is present. - Conditional logic for NULL: The
CASE
statement can distinguish NULLs as well. For example,CASE WHEN address IS NULL THEN 'Unknown' ELSE address END
could be used to label missing addresses as ‘Unknown’. - Replacing NULL with a default: Use functions like
COALESCE()
(standard SQL) orIFNULL()
/NVL()
depending on the database.COALESCE(value, default)
will returnvalue
if it’s not NULL, ordefault
ifvalue
is NULL. For example:SELECT COALESCE(phone, 'N/A') AS phone_number FROM Customers;
This will output the phone number if available, or'N/A'
if the phone is NULL. You can chainCOALESCE
with multiple arguments as well; it returns the first non-NULL argument. - Aggregates and NULL: Aggregate functions like
SUM
,AVG
, etc., ignore NULL values. However,COUNT(*)
counts rows regardless of NULLs, whereasCOUNT(column)
counts only non-NULL values in that column. Keep this in mind when interpreting results. - NULL-safe comparisons: Some databases offer special operators (e.g., the
<=>
operator in MySQL) to compare with NULL safely, but generally usingIS NULL
is preferred.
In summary, to check for NULL, useIS NULL
/IS NOT NULL
. To replace NULL with something meaningful (like 0, or ‘Unknown’), useCOALESCE()
or similar functions. By understanding how NULL behaves, you can avoid bugs (like missing results due to aWHERE
filter that inadvertently filters out NULLs) and present data more clearly (e.g., no one wants to see “null” in a report where a descriptive phrase makes more sense).
- Checking for NULL: Use the
- What is the
LIMIT
clause in SQL, and how is it used?
Answer: TheLIMIT
clause is used (in MySQL, PostgreSQL, SQLite, and some other SQL dialects) to restrict the number of rows returned by a query. It is often used to fetch a sample of results, implement pagination, or get the “top N” records of a certain query.
Usage:LIMIT <number_of_rows>
at the end of a query will return at most that many rows. For example:SELECT * FROM sales ORDER BY sale_date DESC LIMIT 10;
will return the first 10 rows from thesales
table sorted bysale_date
in descending order (likely giving the 10 most recent sales).
Some databases (like Oracle or MS SQL Server) don’t useLIMIT
but have alternative syntax: Oracle usesFETCH FIRST n ROWS ONLY
or old-styleROWNUM
, and SQL Server usesTOP n
(placed right after SELECT, e.g.,SELECT TOP 10 * FROM ... ORDER BY ...
). But the purpose is the same – to limit output.
LIMIT with OFFSET: Many systems supportLIMIT <count> OFFSET <skip>
to skip a certain number of rows and then take the next count. For instance,LIMIT 10 OFFSET 20
would skip the first 20 results and then return the next 10, which is useful for pagination (page 3 of results with 10 per page, for example).
Important: When usingLIMIT
, especially without anORDER BY
, the rows you get are not guaranteed to be in a meaningful or consistent order (SQL tables are inherently unordered). So typically, you useORDER BY
to control which rows are being limited (e.g., “the top 5 highest scores” requires anORDER BY score DESC LIMIT 5
). Without anORDER BY
,LIMIT 10
will just return an arbitrary 10 rows as determined by the query plan (which could be unpredictable).
In summary,LIMIT
is a convenient way to constrain result set size and is particularly handy for large datasets when you only need to preview data or fetch a subset. - How would you retrieve the top 5 highest sales amounts from a
sales
table?
Answer: To get the top 5 highest values, you need to sort the data in descending order by the sales amount and then limit the results to 5. In SQL dialects that supportLIMIT
, the query would be:SELECT amount FROM sales ORDER BY amount DESC LIMIT 5;
This will return the 5 largestamount
values from thesales
table. If there are multiple columns you want to retrieve (say, also the sale ID or date along with the amount), you can select those as well; just ensure the sort is by theamount
in descending order.
If your SQL dialect doesn’t supportLIMIT
, you would use an equivalent method:- In Microsoft SQL Server or Azure SQL:
SELECT TOP 5 amount FROM sales ORDER BY amount DESC;
- In Oracle SQL (12c and above):
SELECT amount FROM sales ORDER BY amount DESC FETCH FIRST 5 ROWS ONLY;
- In Oracle older versions (pre-12c) without the FETCH clause, one common approach was using a subquery with
ROWNUM
.
Note: If there is a tie in the 5th place (e.g., the 5th and 6th highest amounts are equal), the above query will arbitrarily pick five of them based on the sort stability. If you want all rows that tie for the 5th highest amount, you’d need a slightly different approach (like a subquery to get “amount >= some cutoff”). But typically “top 5” implies any arbitrary 5 if there’s a tie at the boundary, unless otherwise specified.
- In Microsoft SQL Server or Azure SQL:
- What is a window function in SQL? Provide an example.
Answer: A window function (also known as an analytic function) performs a calculation across a set of table rows that are somehow related to the current row, as defined by anOVER
clause. Unlike aggregate functions, window functions do not collapse rows—they produce a result for each row while having the ability to look at data in other rows of the result set. Common window functions include ranking functions (ROW_NUMBER()
,RANK()
,DENSE_RANK()
), aggregate functions used as window functions (SUM()
,AVG()
, etc. with an OVER clause), and others likeLAG()
,LEAD()
for accessing neighboring row values.
A window function’s syntax involves the function call followed byOVER (...)
, where inside the parentheses you can specify:- a PARTITION BY clause to divide rows into groups (like grouping, but without collapsing),an ORDER BY clause to order rows within those partitions (necessary for functions like ranking or calculating running totals), andpossibly a window frame specification (for example, rows between unbounded preceding and current row, which is the default frame for many aggregates if an ORDER BY is present).
Example: Suppose we have asales
table with columnsproduct_id
andamount
. If we want to calculate the total sales amount for each product and show it on each row (alongside each individual sale), a window SUM can be used:
SELECT product_id, amount, SUM(amount) OVER (PARTITION BY product_id) AS total_sales_per_product FROM sales;
This usesSUM(amount) OVER (PARTITION BY product_id)
as a window function. It will output each sale record, and the third column will be the sum of allamount
values for that same product_id. Unlike a GROUP BY query, this window query preserves each individual sale row. For instance, if product 101 had three sales of $50, $30, $20, each of those rows would show atotal_sales_per_product
of $100 (the sum of 50+30+20).
Another example (Ranking): If we want to rank employees by salary within each department, we could use:SELECT department, employee_name, salary, RANK() OVER (PARTITION BY department ORDER BY salary DESC) AS salary_rank_in_dept FROM Employees;
This will list employees and give them a rank (1 = highest salary) within their department.
Window functions are powerful for running totals, moving averages, cumulative distributions, comparing a row to others (like finding the difference from the average), etc., all without losing the detail of each row. - a PARTITION BY clause to divide rows into groups (like grouping, but without collapsing),an ORDER BY clause to order rows within those partitions (necessary for functions like ranking or calculating running totals), andpossibly a window frame specification (for example, rows between unbounded preceding and current row, which is the default frame for many aggregates if an ORDER BY is present).
- How do you calculate a running total (cumulative sum) in SQL?
Answer: A running total can be computed using a window function, specifically by applyingSUM()
as a window (analytic) function with an appropriate window frame. The idea is to sum values from the start up to the current row’s position when ordered by some column (usually a date or an explicitly defined sequence).
Example: Suppose we have asales
table withsale_date
andamount
. To calculate a running total of sales amount over time, we can do:SELECT sale_date, amount, SUM(amount) OVER (ORDER BY sale_date ROWS UNBOUNDED PRECEDING) AS running_total FROM sales;
This usesROWS UNBOUNDED PRECEDING
in the window frame to indicate that, for each row, the SUM should consider all rows from the start of the partition up to the current row (when ordered bysale_date
). The resultrunning_total
will start at the first date’s amount, then accumulate as dates progress. If two rows have the exact same timestamp, you might include a secondary ordering (or use RANGE instead of ROWS depending on needs).
Many databases allow a simplified syntax:SUM(amount) OVER (ORDER BY sale_date)
which by default uses an “unbounded preceding to current row” frame for cumulative sums. In SQL Server, for example, you might write:SUM(amount) OVER (ORDER BY sale_date ROWS UNBOUNDED PRECEDING) AS running_total
and in Oracle or PostgreSQL, similar syntax works (with slight variations in default framing).
Important: Ensure theORDER BY
in theOVER
clause defines the logical sequence for accumulation (e.g., by date or by an ID that reflects sequence). If you partition the data (usingPARTITION BY
), the running total resets for each partition group. For example,SUM(amount) OVER (PARTITION BY customer_id ORDER BY sale_date)
would give a running total per customer.
Alternative approach (older SQL): Before window functions were widely supported, people sometimes achieved running totals via self-joins or correlated subqueries (each row summing up all previous rows). But window functions make it far easier and more efficient. They handle ordering and partitioning internally and often use algorithms that are optimized for this cumulative calculation. - How is the
EXISTS
clause used in a SQL query?
Answer: TheEXISTS
clause is used in a WHERE condition to test whether a subquery returns any rows. It returns a boolean result –TRUE
if the subquery returns at least one row, andFALSE
if the subquery returns no rows.EXISTS
is often used for checking the existence of related data without actually needing to return that data, and it can be more efficient than using IN or a JOIN in certain cases (especially when the goal is purely existence testing).
Syntax:SELECT columns FROM table t WHERE EXISTS (subquery_that_references_t);
The subquery typically references the outer query’s table to correlate them. The moment a single row is found that satisfies the subquery,EXISTS
returns true and stops processing further.
Example: Suppose we have two tables,Customers
andOrders
, and we want to find all customers who have at least one order. We can write:SELECT customer_id, customer_name FROM Customers c WHERE EXISTS ( SELECT 1 FROM Orders o WHERE o.customer_id = c.customer_id );
Here the subquerySELECT 1 FROM Orders o WHERE o.customer_id = c.customer_id
will attempt to find an order for the current customerc
. We useSELECT 1
because we don’t actually need any column – we just want to know if a row exists. If an order exists, the subquery returns a row (or many, but EXISTS doesn’t care beyond the first), makingEXISTS (...)
true, so that customer will be included. If no matching order is found, the subquery returns nothing andEXISTS
is false, excluding that customer.
Notes:EXISTS
is typically efficient because the database can often perform a semi-join, stopping the subquery search as soon as a match is found. This can be faster than a JOIN if you only care about existence and not the actual data from the other table.- If you want to ensure something does not exist, you can use
NOT EXISTS
similarly, which is true when the subquery returns no rows. For example, to find customers with no orders, you’d useWHERE NOT EXISTS (SELECT 1 FROM Orders o WHERE o.customer_id = c.customer_id)
. - From a readability perspective,
EXISTS
explicitly conveys “I am checking for existence of related records,” which can make complex queries easier to understand compared to alternative patterns.
- What is a temporary table in SQL, and when would you use one?
Answer: A temporary table is a table that is created and exists temporarily on the database server, typically for the duration of a user session or a transaction. Temporary tables are often used to store intermediate results that you need to use across multiple steps of a complex operation, without permanently storing that data in the main schema. They are isolated from other sessions, meaning one user’s temporary table won’t conflict with another’s.
Characteristics:- In many SQL systems (like SQL Server, PostgreSQL), you create a temporary table by prefixing the table name with a symbol (e.g.,
#TempTable
for a local temp table in SQL Server) or by using a specific syntax (CREATE TEMPORARY TABLE ...
in MySQL).Temporary tables are automatically dropped when your session ends (for session-scoped temp tables) or when the transaction ends (for transaction-scoped, if supported). You can also manually drop them earlier if needed.The schema of a temporary table can be defined just like a regular table, including primary keys, indexes, etc., but it resides either in a temporary schema or in-memory depending on the database.
When to use:Complex Queries: If you have a complex analytical query that’s easier to solve in steps, you might create a temp table to hold the result of step 1, then join or filter on it in step 2. This can sometimes make the logic clearer or even improve performance by breaking a query into smaller pieces.Performance Tuning: Sometimes a single massive query with many subqueries or joins becomes slow or difficult for the optimizer. Storing intermediate results in a temp table (which can be indexed, etc.) might help. It also allows you to examine the intermediate data for debugging.Avoiding Redundant Calculations: If you need to use the same complex result set in multiple places within a procedure or batch, computing it once into a temporary table and then reusing it can be more efficient than re-computing it via subqueries each time.Batch Processing: Temporary tables can be used in stored procedures to accumulate results from loop iterations or to batch insert/update operations.
Example:
CREATE TEMPORARY TABLE temp_sales AS SELECT customer_id, SUM(amount) AS total_amount FROM sales WHERE sale_date >= '2025-01-01' GROUP BY customer_id;
This MySQL example creates a temp tabletemp_sales
with total sales per customer for the year 2025. You could then indextemp_sales
or join it with other tables for further analysis, and it will automatically disappear at the end of your session.
In summary, use a temporary table when you need to break down tasks or hold intermediate data without requiring permanent storage. They are a handy tool for scripting and complex procedures in SQL. - In many SQL systems (like SQL Server, PostgreSQL), you create a temporary table by prefixing the table name with a symbol (e.g.,
- How do you use the
ALTER TABLE
statement? Give an example.
Answer: TheALTER TABLE
statement is used to modify an existing table’s structure. WithALTER TABLE
, you can add, drop, or rename columns, change a column’s data type, add or drop constraints (like primary keys, foreign keys, indexes), or adjust table options. The exact syntax can vary slightly by database, but generally:ALTER TABLE table_name ADD COLUMN new_column datatype [constraints];
(to add a new column), orALTER TABLE table_name DROP COLUMN column_name;
(to remove a column), etc.
Examples:- Adding a column: If we have a
sales
table and we want to add a new column for discount, we could do:ALTER TABLE sales ADD COLUMN discount DECIMAL(10,2);
This adds a new column nameddiscount
with a decimal data type (for monetary values) to thesales
table. By default, the new column would contain NULLs for existing rows (unless aDEFAULT
value is specified or the column is defined as NOT NULL without a default, in which case the operation might fail until a default is provided). - Modifying a column: To change the data type or length of a column, the syntax might be:
ALTER TABLE sales ALTER COLUMN discount TYPE DECIMAL(12,2);
(PostgreSQL syntax to widen a decimal, for example), or in SQL Server:ALTER TABLE sales ALTER COLUMN discount DECIMAL(12,2);
to change the definition of thediscount
column. - Dropping a column:
ALTER TABLE sales DROP COLUMN discount;
would remove thediscount
column from the table (and all data in that column). - Renaming a column: Some dialects use
ALTER TABLE ... RENAME COLUMN old_name TO new_name;
. Others, like SQL Server, require a special stored procedure (sp_rename
). MySQL usesALTER TABLE ... CHANGE old_name new_name datatype
. - Adding/dropping constraints: e.g., to add a new primary key or foreign key after table creation, or to drop an existing constraint, you’d use clauses like
ADD CONSTRAINT ... PRIMARY KEY (...)
orDROP CONSTRAINT ...
.
Use with caution: Altering large tables (especially adding columns with default values, or dropping columns) can be expensive as it may rewrite the whole table. Some alterations can be done online or with minimal locking depending on the database. It’s always good to plan schema changes carefully in a production environment.
In summary,ALTER TABLE
is the go-to command for schema evolution of a table: adding new attributes, modifying definitions, or cleaning up unused ones.
- Adding a column: If we have a
- What is normalization in databases, and why is it important?
Answer:Normalization is the process of structuring a relational database according to a series of normal forms (rules) to reduce data redundancy and improve data integrity. The goal is to ensure that each piece of data is stored once, in the appropriate place, and that dependencies are logical, thereby minimizing anomalies when inserting, updating, or deleting data.
There are several normal forms (1NF, 2NF, 3NF, BCNF, etc.), but generally:- 1st Normal Form (1NF): Ensure each column holds atomic (indivisible) values and each row-column intersection has a single value (no repeating groups or arrays in a single column).
- 2nd Normal Form (2NF): Achieved when 1NF is met and every non-key attribute is fully functionally dependent on the entire primary key (this mainly applies when you have a composite primary key; it addresses partial dependency).
- 3rd Normal Form (3NF): Achieved when 2NF is met and all the columns are dependent only on the primary key, not on other non-key columns (i.e., no transitive dependencies). In simpler terms, no column should depend on another column that isn’t the primary key.
- (Higher forms like BCNF, 4NF deal with more complex scenarios like overlapping candidate keys or multi-valued dependencies, which are beyond typical interview scope unless you dive deep.)
Importance: Normalization is important because it: - Eliminates Redundant Data: By organizing data into tables such that each fact is stored once, it avoids anomalies. For instance, if a customer’s address is stored in only one table, updating that address in one place updates it for all orders, avoiding inconsistent data.
- Improves Data Integrity: Since each piece of data lives in one place, the risk of inconsistencies (two records disagreeing on the same fact) is minimized. Constraints like foreign keys can enforce relationships (referential integrity), and having a normalized schema often means you can rely on those constraints.
- Makes Maintenance Easier: Changes to data (insertion, updates, deletions) require touching fewer tables/rows, and the rules governing the data are clearer. For example, deleting a product category in a well-normalized schema might be straightforward if categories are in their own table and linked properly, whereas in a denormalized schema that might require hunting down multiple tables/fields.
- Query Flexibility: Normalized designs often translate into smaller, more focused tables that can be joined in various ways to answer different questions. If everything were lumped together (denormalized without careful planning), certain queries might be harder or require scanning a lot of redundant data.
Example: A classic example of normalization is separating customer info and order info. Instead of one big table with order and customer details (where a customer’s name and contact info repeats for every order they made), you would have aCustomers
table and anOrders
table with a foreign key linking them. This way, the customer’s data is stored once inCustomers
. This prevents anomalies like one order having an outdated address while another order for the same customer has a new address – because there’s only one place to update the address.
In short, normalization is about being systematic and efficient in how data is stored, which leads to reliable and consistent databases. (Keep in mind that in some cases, denormalization might be introduced later for performance, but normalization is the baseline for sound database design.)
- What is denormalization, and when might you use it?
Answer: Denormalization is the process of intentionally introducing some redundancy into a normalized database design – effectively the opposite of normalization. This is done to improve read performance or simplify query logic at the expense of potential write inefficiencies and data redundancy. In denormalization, you might combine tables, duplicate data, or store aggregated values that would otherwise be computed via joins or subqueries on a normalized schema.
When to use denormalization:- Performance Reasons: If a database has become a bottleneck for read queries due to many complex joins, denormalizing can reduce the number of joins needed. For instance, a reporting system might maintain a summary table that duplicates some data from multiple normalized tables so that a single SELECT can fetch all needed information, rather than performing multiple joins and calculations each time.
- Simpler Queries: In systems like data warehousing or OLAP (analytical processing), denormalized schemas (like star schemas) are used. A star schema has a big fact table and denormalized dimension tables (with data that might be duplicated rather than further normalized into sub-dimensions). This makes it easier for tools and analysts to write straightforward queries (often just one join between fact and dimension per dimension) and is optimized for read-heavy, aggregate-heavy access patterns.
- Caching and Precomputation: Sometimes denormalization occurs by adding extra columns that hold precomputed values (e.g., storing a subtotal or a record count that could be derived via a query). This speeds up retrieval at the cost of having to maintain those values on write.
- Distributed Systems (NoSQL stores): In some NoSQL or big-data systems without join support, one might denormalize data (embed one record inside another) to get all info in one read. For example, storing an array of orders inside a customer document in a document database – this denormalizes the customer/orders relationship to optimize retrieval of a customer with all their orders in one go.
Trade-offs: - Redundancy and Anomalies: Once data is copied in multiple places, there is a risk that one copy gets updated and others do not, leading to inconsistencies. It requires careful design and often additional application or database logic (like triggers) to keep redundant data in sync.
- Storage Cost: More data is stored because of duplication. In modern systems this is often a lesser concern than it used to be, but it’s still a consideration.
- Maintainability: The schema is less clean conceptually. Developers/DBAs need to remember to update multiple places or maintain extra logic. Backfilling denormalized data after the fact can also be complex.
Example: Suppose in a normalized design you haveOrders
andProducts
tables, and a join tableOrderItems
. If you need to frequently query the total revenue per order or the order with all product names, a denormalized table might store each order with a concatenated list of product names and the total price. This avoids joiningOrderItems
andProducts
to compile that info each time. The trade-off is whenever an order item changes or a product name changes, you have to update that denormalized info too.
In summary, denormalization is used pragmatically: you start with a normalized design (for integrity), then selectively denormalize certain parts of the data model to gain performance or simplicity for read-heavy scenarios. It should be done only when necessary, with safeguards to handle the redundancy correctly.
- How can you insert data in bulk into a SQL table?
Answer: There are several ways to perform a bulk insert of data into a SQL table, depending on the context and the database platform. Key approaches include:- Multiple VALUES in a single INSERT: Many databases allow inserting multiple rows with one
INSERT
statement by listing multiple sets of values. For example:INSERT INTO sales (product_name, amount) VALUES ('Laptop', 1200), ('Smartphone', 800), ('Tablet', 600);
This single statement inserts three rows into thesales
table. This approach is good for a moderate number of rows and is straightforward for scripts. (Some databases have an upper limit on how many rows you can insert this way in one statement, but it’s typically quite high or limited by packet size.)Bulk load utilities: Most DBMS have special bulk import tools or SQL commands for loading data from external files efficiently:- In MySQL, you can use
LOAD DATA INFILE 'file_path.csv' INTO TABLE table_name ...
to directly load from a CSV or TSV file. This is much faster than row-by-row inserts for large data.In PostgreSQL, theCOPY
command (and its psql wrapper\copy
) can bulk load from a file or STDIN.SQL Server hasBULK INSERT
T-SQL command, as well as bcp (bulk copy program) and SSIS for bulk operations.Oracle has SQL*Loader for high-speed loading, and external table features.
These utilities are optimized for bulk ingestion, often bypassing some of the overhead that individual INSERTs have (like logging each row).
old_sales
tonew_sales
:INSERT INTO new_sales (product_name, amount) SELECT product_name, amount FROM old_sales;
This will bulk-insert all rows fromold_sales
intonew_sales
in one shot (which could be thousands or millions of rows) without bringing data to the client side.Transactions: When doing bulk inserts, it’s usually faster to wrap them in a single transaction (or in reasonably sized transaction batches) rather than committing each row. This way the database can optimize commits and logging. However, extremely large transactions can also bloat logs and use a lot of memory, so sometimes bulk inserts are done in chunks (like 10k rows per commit, depending on system).
Example using MySQL bulk load: - In MySQL, you can use
LOAD DATA INFILE '/path/to/data.csv' INTO TABLE sales FIELDS TERMINATED BY ',' ENCLOSED BY '"' LINES TERMINATED BY '\n' (product_name, amount);
This would quickly import all rows from a CSV file into thesales
table.
In summary, to insert in bulk you either combine many rows into one SQL command or use database-specific bulk import tools, both of which minimize overhead compared to inserting rows one-by-one. - Multiple VALUES in a single INSERT: Many databases allow inserting multiple rows with one
- What is an index in SQL, and why would you use one?
Answer: An index in SQL is a data structure that improves the speed of data retrieval on a table at the cost of additional writes and storage. You can think of an index like an index in a book: instead of flipping through every page to find a topic, you jump to the index which quickly directs you to the pages containing that topic. In a database, an index does something similar by allowing the database engine to quickly locate rows based on the indexed column(s).
How it works:- Most relational database indexes are implemented as B-trees (or variants like B+ trees). There are also other types (hash indexes, inverted indexes for full-text, etc.), but B-tree is common for general-purpose indexing.
- When you create an index on a column (say,
CREATE INDEX idx_name ON Customers(last_name)
), the database builds a sorted structure (B-tree) of the last names and pointers to the full table rows (or to an underlying row ID). - This sorted structure allows efficient binary search-like lookups, typically resulting in logarithmic time complexity rather than linear. So finding all customers with last name = “Smith” can be done very quickly by traversing the tree, instead of scanning every row in the
Customers
table.
Why use one: - Performance for SELECTs: The primary reason is to speed up queries that filter on certain columns or join on certain columns. If you often query a table for
WHERE column = someValue
orWHERE column BETWEEN X and Y
or do a lot ofORDER BY column
, an index on that column can drastically reduce query time because the database can locate the relevant data without examining each row. - Sorting and Grouping: Indexes can also help with sorting (
ORDER BY
) and grouping (GROUP BY
) because the data is already sorted in index order. The database might read directly from the index if it covers the required columns. - Enforcing Uniqueness: A unique index (like what’s automatically created for a primary key or a unique constraint) ensures no duplicate values are entered in that column and also provides fast lookups.
Trade-offs: - Slower Writes: Indexes need to be updated whenever the data in the indexed column changes or when new rows are inserted/deleted. This adds overhead to
INSERT
,UPDATE
, andDELETE
operations. For example, inserting a row might mean not only adding to the table but also inserting a key in each index’s B-tree. - Extra Storage: Indexes take up disk space (and memory when in use). A table with many indexes will occupy more storage and might have larger database backups, etc.
- Memory Usage: More indexes means the database might use more memory to cache index pages.
- Diminishing returns: An index helps specific queries. If a query doesn’t use that column (or combination of columns), the index is irrelevant. Too many indexes can actually degrade overall performance if not all are needed, due to the write overhead.
- Choosing indexes: Typically you index columns that are frequently used in WHERE conditions, join conditions, or used in ordering/grouping, especially if the table is large. Columns with high selectivity (many distinct values) tend to benefit more. There’s also the concept of composite indexes (multi-column indexes) for queries that filter on multiple columns together.
Example: Imagine aCustomers
table with millions of rows and a querySELECT * FROM Customers WHERE email = 'john@example.com';
. Without an index onemail
, the database does a full table scan (checks every row). With an index onemail
, it can very quickly find the specific entry for ‘john@example.com’ in the index and jump to the row, which is much faster.
In summary, an index is a performance-tuning feature that helps queries run faster by reducing the amount of data the database needs to scan. You would use one whenever you identify a query pattern that is slow due to scanning and can be improved by indexing certain columns.
- How do you create an index on a table? Give an example.
Answer: You create an index using theCREATE INDEX
statement (outside of any SELECT/INSERT/UPDATE; it’s a DDL command like creating a table). The basic syntax is:CREATE INDEX index_name ON table_name (column1, column2, ...);
You specify the table and the column(s) you want to index. You also give the index a name (which should be unique among indexes in that schema). Optionally, you can add keywords likeUNIQUE
(to enforce uniqueness) or specify the index type (likeUSING BTREE
in MySQL, or create aCLUSTERED/NONCLUSTERED
index in SQL Server, etc., depending on the DBMS).
Example: If we have asales
table and we often query byproduct_name
, we might create an index on that column:CREATE INDEX idx_product_name ON sales(product_name);
This will create an index namedidx_product_name
on thesales
table’sproduct_name
column. After creating this index, queries such asSELECT * FROM sales WHERE product_name = 'Laptop';
will likely run faster because the database can use the index to find all rows with product_name = ‘Laptop’ quickly.
Composite Index Example: Suppose we often queryOrders
bycustomer_id
and within that byorder_date
(e.g., find orders for customer X in a date range). We might create a composite index:CREATE INDEX idx_orders_customer_date ON Orders(customer_id, order_date);
This index is sorted first bycustomer_id
, and then byorder_date
within each customer. It can efficiently support queries that filter by customer_id and date, or just by customer_id alone (as a leftmost prefix of the index).
Unique Index Example: If we want to ensure emails are unique in aCustomers
table and also index them, we might do:CREATE UNIQUE INDEX idx_customers_email ON Customers(email);
This not only creates an index onemail
, but also enforces that no two customers can have the same email (unique constraint).
When creating indexes: Keep in mind that creating an index on a very large table can take time and temporarily use a lot of CPU/IO. It’s often done during a maintenance window for huge tables. Some databases lock the table or slow down modifications during index creation (others have online index builds). Also, after creation, the database’s query optimizer will automatically decide when to use the index based on query patterns and cost estimates; you don’t “use” an index manually in the query (beyond occasionally giving hints if necessary).
In short, useCREATE INDEX
followed by an appropriate name, table, and column list to add an index to a table for faster querying. - What techniques can you use to optimize SQL query performance?
Answer: Optimizing SQL query performance can involve several strategies, ranging from rewriting the query to adjusting the database schema or configuration. Here are some common techniques:- Indexes: Ensure appropriate indexes exist for the query’s filtering/join conditions. Without indexes, the database often has to perform full table scans. For example, if a query filters on a
WHERE status = 'ACTIVE'
and that’s a frequent operation, an index on thestatus
column can help. Likewise, indexing join columns can drastically speed up join performance. However, avoid over-indexing (too many indexes) as that can slow down writes and bloat the database. - Query Refactoring: Rewrite inefficient queries. For instance, selecting only needed columns instead of
SELECT *
can reduce I/O and memory usage (especially if some columns are large blobs or text). Removing unnecessary complex expressions or subqueries helps. Sometimes one complex query can be broken into multiple simpler queries with intermediate temp tables (if the single query plan is not optimal). Conversely, multiple queries can sometimes be combined to avoid repeated work. - Avoiding or Reducing Subqueries: In some cases, especially older database versions, a subquery might be re-run for each row (correlated subqueries) or might not be optimized into a join. Converting subqueries to JOINs or using
WITH
(common table expressions) can help the optimizer. Modern engines often optimize correlated subqueries well, but it’s something to consider if performance is an issue. - Efficient Joins: Ensure that joins are done on columns that are indexed or at least of the same type and size. Use appropriate join types; for example, if you only need matching rows from both tables, use INNER JOIN rather than OUTER. Also, joining on highly selective conditions first (when controlling join order via hints or by query structure) can reduce intermediate row counts. In some cases, splitting very large join operations into smaller chunks (maybe by filtering by date ranges) can be practical if the full join is too heavy.
- Use of EXISTS vs IN vs JOIN: For existence checks, sometimes
EXISTS
is more efficient than anIN
clause or an equality join, because the database can stop at the first match. For example, replacingWHERE id IN (SELECT ...)
withWHERE EXISTS (SELECT 1 ...)
might improve performance. Conversely, if subqueries return a lot of data, a join might be better. It’s good to understand how your specific DBMS optimizes these. - Limit the Data Scanned: Use
WHERE
clauses to filter data as early as possible. Avoid computations on columns in the WHERE clause that prevent index usage (e.g.,WHERE YEAR(date) = 2025
can’t use an index on date; better isWHERE date >= '2025-01-01' AND date < '2026-01-01'
). Similarly, avoid leading wildcards in text searches (LIKE '%Smith'
can’t use an index efficiently, whereasLIKE 'Smith%'
can). - Proper Use of Wildcards and Full-text Indexes: If doing a lot of text searching, consider full-text indexes rather than
LIKE
with wildcards on large text fields. - Denormalization or Caching Results: In some cases, especially for complex aggregations or reports, creating summary tables or denormalized structures can serve queries faster than hitting normalized transactional tables with lots of joins and calculations each time. Materialized views (in databases that support them) or scheduled jobs to pre-compute heavy results can make queries essentially just do a quick select from a precomputed table.
- Pagination and Large Results: If only the first N rows are needed, use
LIMIT
/TOP
to avoid heavy work on all rows. When doing pagination, using indexed WHERE clauses to seek to your starting point can be more efficient than OFFSET on large offsets. - Updating Statistics and Query Tuning: Ensure the database’s statistics on table data are up to date, as the query optimizer relies on them for good decisions. In some databases, you can also inspect the query execution plan (using
EXPLAIN
or similar) and identify bottlenecks (like a full table scan or a suboptimal join strategy) and then add hints or indexes as needed. - Hardware and Configuration: At a higher level, ensure the server has enough memory so that indexes and frequently accessed data can be cached, and consider partitioning large tables or using read replicas if the workload is heavy.
Example: If you have a slow querySELECT * FROM Orders WHERE customer_id = 123 AND order_date > '2025-01-01'
, and you find it’s doing a table scan, consider an index on(customer_id, order_date)
– which allows the database to directly locate orders for customer 123 in the date range, instead of scanning all orders. Also, ifOrders
is huge and historical, partitioning by year might help so that the query only looks at recent partitions.
In summary, optimizing SQL queries often involves creating the right indexes, rewriting the query to be sargable (searchable) and lean, and understanding the execution plan to address bottlenecks. Performance tuning is part art, part science, and heavily dependent on the specific database and query patterns.
- Indexes: Ensure appropriate indexes exist for the query’s filtering/join conditions. Without indexes, the database often has to perform full table scans. For example, if a query filters on a
- How do you analyze a query’s execution plan or performance in SQL?
Answer: To analyze how a database executes a query (the execution plan), most SQL databases provide anEXPLAIN
orEXPLAIN PLAN
command (or in tools, a visual query plan) that details the steps the optimizer will take. Examining an execution plan helps identify performance issues like full table scans, inefficient join algorithms, or missing index usage. Here’s how you do it and what to look for:- Using EXPLAIN: In many databases, you prefix your query with
EXPLAIN
. For example,EXPLAIN SELECT * FROM sales WHERE amount > 1000;
. The output is not the data, but a summary of each step in the query execution: which indexes are used or not used, join types (nested loop, hash join, merge join, etc.), estimated number of rows processed at each step, and so on. Different DBMSs have different formats (MySQL shows id, select_type, key, possible_keys, etc.; PostgreSQL gives a tree of operations with cost estimates; SQL Server has a graphical plan or XML). - Looking for table scans vs index usage: If the plan says something like “Seq Scan” (sequential scan) or “Table Scan” on a large table, that means no suitable index was used for a filter condition. Perhaps adding an index could help, or maybe the query filters a very large portion of the table anyway (in which case an index might not help much). If the plan shows “Index Scan” or “Index Seek”, that indicates it’s using an index to find data. Generally, an “Index Seek” on a highly selective condition is very good (few rows examined).
- Join algorithms: The plan will show how tables are joined: Nested Loop (good for small to medium sets, especially with index on inner side), Hash Join (good for large, unsorted data sets, but uses memory), Merge Join (good when both inputs are sorted on the join key). If a nested loop is iterating over a huge number of rows without an index, that’s a red flag. If a hash join spills to disk because of insufficient memory, that’s a performance issue.
- Row estimates vs actual: Some tools (like Postgres’s EXPLAIN ANALYZE or SQL Server’s actual execution plan) let you run the query and see actual counts and times, not just estimates. Large discrepancies between estimated rows and actual rows can indicate outdated statistics or a case where the optimizer misestimated, leading to a suboptimal plan.
- Identify the slowest step: In a complex plan, usually one part dominates the cost. It could be a particular join or a sort operation or an aggregation. For example, an expensive sort could mean maybe an index on that order-by column would avoid the sort. An expensive aggregation could mean maybe a pre-aggregated table would help.
- Profiling using tools: Many databases also have profiling tools or can show resource usage for a query. For instance, in MySQL you can do
SHOW PROFILES
after enabling profiling to see time spent. In SQL Server, you might use SET STATISTICS IO and TIME to see I/O and time metrics, or use the Query Analyzer. In Oracle, you have the SQL Trace and TKPROF utilities, as well as v$SQL view stats. - Execution plan example:
Suppose we runEXPLAIN SELECT * FROM Orders o JOIN Customers c ON o.customer_id = c.id WHERE c.country = 'US';
. The plan might show:- a sequential scan on Customers with a filter country = ‘US’,
- then a nested loop joining to Orders using an index on Orders(customer_id).
If the Customers table is large and country is not indexed, we see a seq scan – which suggests adding an index on country could avoid scanning all customers. If country = ‘US’ yields 50% of customers, maybe an index isn’t super selective (so scan might be okay). But if it yields 5% of customers, an index would likely help.
- Use of EXPLAIN in practice: You typically get the plan, identify spots like “why is it doing a table scan here?” or “why is the cost so high for this join?”, then adjust the query or indexes and test again. It’s a bit of an iterative process.
In summary, analyzing an execution plan is done via database-specific explain tools. You interpret the plan to see how the query is executed step-by-step, then use that insight to guide performance tuning (adding indexes, rewriting the query, updating statistics, etc.). It’s one of the most powerful techniques to figure out why a query is slow and what can be done about it.
- Using EXPLAIN: In many databases, you prefix your query with
- What does the
COALESCE()
function do in SQL?
Answer:COALESCE()
is a built-in SQL function that returns the first non-NULL value in a list of expressions. It is often used to substitute default values for NULLs or to handle optional fields. The general syntax is:COALESCE(expr1, expr2, ..., exprN)
. It will evaluate each expression in order and return the first one that is not NULL; if all expressions are NULL,COALESCE
itself returns NULL.
Key points:COALESCE(x, y)
is essentially equivalent to a shorthand forCASE WHEN x IS NOT NULL THEN x ELSE y END
. If x is not null, use x; otherwise use y. With more arguments, it’s like cascading that logic.All expressions in the list should be of compatible types, since the result type ofCOALESCE
is determined by the arguments (generally it takes the data type of the highest precedence argument among them, following type conversion rules).
Common uses:Replacing NULL with a default: For example, if you have adiscount
column that is NULL when no discount was offered, you might queryCOALESCE(discount, 0)
to treat NULL as 0% discount. This way calculations are easier (NULL can be problematic in arithmetic).Choosing among multiple nullable columns: Suppose you have a table where you can store a phone number in either ahome_phone
orcell_phone
column (and one of them may be NULL). You could doCOALESCE(home_phone, cell_phone, 'No Phone')
to get whichever phone is available, or a placeholder if none.Pivoting data: Sometimes used in pivot queries or concatenating strings where some parts might be NULL.Avoiding NULL in aggregates: For instance,SUM(COALESCE(amount, 0))
will treat NULL amounts as 0 in the sum. (Though aggregate functions typically ignore NULLs anyway, this is more relevant for something likeCOUNT(COALESCE(col, 0))
vsCOUNT(col)
difference in certain cases, or for concatenation.)
Example 1:
SELECT order_id, COALESCE(shipped_date, 'Pending') AS status FROM Orders;
Ifshipped_date
is NULL (order not shipped yet), this will output “Pending” for the status instead of a NULL. Ifshipped_date
has a value, it will show that value (or you might format it accordingly).
Example 2:-- Assume a schema where a person can have multiple nicknames, but we want the first non-null. SELECT person_id, COALESCE(nickname1, nickname2, nickname3, name) as display_name FROM Persons;
This tries nickname1, if that’s NULL then nickname2, if that’s NULL then nickname3, and if that’s also NULL, finally falls back to the actual name. Essentially, “pick the first available nickname, otherwise use real name.”
Note:COALESCE
is part of the ANSI SQL standard and is supported in most databases. Some databases also have similar functions likeNVL(expr1, expr2)
in Oracle (which is just two-argument coalesce), or theISNULL(expr1, expr2)
in SQL Server. ButCOALESCE
is portable and handles multiple arguments.
In summary,COALESCE
helps in dealing with NULLs by providing fallback values, making query results more informative and avoiding NULL in outputs where not desired. - How can you calculate the percentage of total sales contributed by each product?
Answer: To calculate a percentage of a total in SQL, you need to divide the part by the whole and typically multiply by 100. There are a couple of ways to do this in a query: using a subquery or using a window function.
Using a subquery approach:- First, find the total sales across all products.Then, for each product’s sales, divide by that total.
For example:
SELECT product_name, SUM(amount) AS product_total, SUM(amount) * 100.0 / total.all_sales AS percentage_of_total FROM sales, (SELECT SUM(amount) AS all_sales FROM sales) AS total GROUP BY product_name, total.all_sales;
In this query, the subquery(SELECT SUM(amount) AS all_sales FROM sales)
produces the overall total sales (let’s call it X). The main query groups by product and computesSUM(amount)
per product. DividingSUM(amount)
by X and multiplying by 100 gives the percentage contribution of that product. We multiply by 100.0 (note the use of a decimal to ensure a fractional result and avoid integer division) to express it as a percentage. The result might look like: Product A – 12.5%, Product B – 7.3%, etc., summing up to 100%.
Using a window function approach:
You can use the fact that window functions can compute the total sum across all rows (partitioned appropriately) without a separate subquery:SELECT product_name, SUM(amount) AS product_total, (SUM(amount) * 100.0) / SUM(amount) OVER () AS percentage_of_total FROM sales GROUP BY product_name;
Here,SUM(amount) OVER ()
is a window function that gives the sum of amount over all rows (since no partition is specified, it considers the entire result set as one partition). This effectively provides the total sales value X to each row of the grouped result. So each product’sSUM(amount)
is divided by that overall SUM (and multiplied by 100) to yield the percentage. The advantage is you don’t need to write a separate subquery for the total. Not all SQL dialects allow combining grouping and window in the same query nicely as above (some require the window in a subquery or a CTE because of processing order), but many do via a CTE:WITH product_totals AS ( SELECT product_name, SUM(amount) as product_total FROM sales GROUP BY product_name ) SELECT product_name, product_total, product_total * 100.0 / SUM(product_total) OVER () AS percentage_of_total FROM product_totals;
This CTE approach first gets totals per product, then uses a window to compute the overall total across those totals.
Precision and formatting: The result of the division might be a fractional number. You can format it to, say, two decimal places or add a ‘%’ sign if needed in the presentation layer. Also consider using decimal or float types appropriately. Using100.0
ensures floating point or high-precision numeric calculation instead of integer division.
Example numbers: If total sales is $50,000 and Product A has $5,000, then percentage = 5000/50000 * 100 = 10%. Our query would output 10.0 for product A’s percentage (which you might interpret as 10%). Summing the percentage_of_total for all products should yield 100 (give or take any rounding differences).
So, the straightforward concept is: Percentage = (Part / Whole) * 100. In SQL, use either a subquery or window function to obtain the whole (total) within the same query to compute each part’s share of it. - First, find the total sales across all products.Then, for each product’s sales, divide by that total.
- What is a view in SQL, and how do you create one?
Answer: A VIEW is a virtual table in SQL, defined by a query. It doesn’t store data physically (except in the case of materialized views), but rather provides a saved SELECT statement that you can query as if it were a table. Views are used to simplify complex queries, to provide a layer of abstraction or security (restricting users to certain subsets of data), and to present data in a certain format.
Creating a view: Use theCREATE VIEW
statement. For example:CREATE VIEW high_value_sales AS SELECT order_id, customer_id, amount, sale_date FROM sales WHERE amount > 1000;
This creates a view namedhigh_value_sales
. Now, you can treathigh_value_sales
like a table in queries:SELECT customer_id, SUM(amount) FROM high_value_sales GROUP BY customer_id;
When you query the view, the database essentially runs the underlying SELECT (SELECT order_id, ... FROM sales WHERE amount > 1000
) and then applies whatever outer query you wrote.
Benefits of views:- Simplicity: Encapsulate a complex join or calculation. For instance, if you often join multiple tables, a view can hide that complexity and end-users or reports can just
SELECT
from the view directly. - Consistency: Ensure that certain business logic (like a filter or a calculation) is applied uniformly. E.g., a view could always filter out soft-deleted records, so any query on the view automatically omits those, rather than every developer remembering to put
WHERE deleted_flag = 0
. - Security: Limit access to sensitive data. You can grant users permission to select from a view but not from the underlying table, so they see only the columns/rows exposed by the view. For example, a view might exclude salary information or anonymize data.
- Aggregation or formatting: A view could present data aggregated (like a summary view) or with specific formatting (like concatenating names, etc.), so that higher-level consumers don’t have to redo that logic.
Updating through views: If the view is simple (like a direct select from a table or straightforward join), some DBMSs allow insert/update/delete on the view which propagate to underlying tables. But not all views are updatable (especially if they involve joins, groupings, or distinct). There are also materialized views (also called indexed views in SQL Server) which do store a copy of the data for performance, but those are a more advanced concept and need refreshing.
Example scenario: Suppose you have a banking database with anAccounts
table and a relatedTransactions
table. You might create a viewCurrentBalance
that joins them and gives one row per account with the current balance calculated. Then any query or report that needs account balances can just select fromCurrentBalance
view, instead of having each one join and sum transactions (which risks inconsistency if done differently in different places).
Creating a view is straightforward as shown above; to drop a view, you useDROP VIEW view_name;
.
In summary, a view is like a stored query that can be treated as a table, simplifying how you interact with complex or sensitive data.
- Simplicity: Encapsulate a complex join or calculation. For instance, if you often join multiple tables, a view can hide that complexity and end-users or reports can just
- What is the difference between the
DELETE
,TRUNCATE
, andDROP
commands in SQL?
Answer: These three SQL commands are all used to remove data, but at different levels and with different scope and implications:- DELETE: This is a Data Manipulation Language (DML) command that removes rows from a table based on an optional
WHERE
condition. You can delete specific records or all records (if you omit the WHERE). Each deletion is logged (for rollback purposes), and triggers on the table (if any) will fire for each row deleted. After aDELETE
, the table structure and its indexes remain, and so do any empty space left by the deleted rows (though it can be reclaimed for re-use). Importantly, you canROLLBACK
a DELETE if it’s in a transaction, to undo the deletions. For example:DELETE FROM Employees WHERE department = 'Sales';
will remove all employees in Sales department, but leave others. If no condition:DELETE FROM Employees;
removes all rows (but the table is still there, just empty). - TRUNCATE: This is a Data Definition Language (DDL) command (though it behaves somewhat like DML) that quickly removes all rows from a table. It is essentially a shortcut to wipe a table. It usually cannot be rolled back (unless your DB supports it in a transaction and it hasn’t been committed yet; for many systems it auto-commits). Truncate is typically faster than DELETE for large tables because it doesn’t log each row deletion; instead it deallocates whole data pages or resets the storage. Truncate also resets identity counters (if the table has an auto-increment column) in many implementations, whereas a DELETE would not (it would keep the last identity value). Also, triggers do not fire on TRUNCATE (since it’s DDL). Syntax:
TRUNCATE TABLE Employees;
– after this, the table is empty, but the structure remains. One cannot truncate if there are foreign key references to that table (to maintain referential integrity, you must disable or remove those first). - DROP: This is a DDL command that removes the schema object itself.
DROP TABLE Employees;
will completely delete theEmployees
table from the database – meaning its definition (column structure, constraints, indexes) and all its data are gone. After a drop, it’s as if the table never existed (except you might be able to recreate it). There is no rolling back a DROP (except in some databases if it’s within an explicit transaction that’s not committed – but many databases auto-commit DDL statements). There are similar drop commands for other objects:DROP VIEW
,DROP INDEX
, etc. Dropping a table will drop all data and associated structures like indexes and constraints. You have to be careful as this is a destructive, irreversible action in normal operation.
Summary of differences: - Scope of removal: DELETE = some or all rows from a table; TRUNCATE = all rows from a table; DROP = the entire table (schema and data).
- Conditional: DELETE can have a WHERE clause to selectively remove rows; TRUNCATE and DROP do not have a WHERE – they’re all-or-nothing on their target.
- Transaction/Undo: DELETE operations can be rolled back (if not committed) since each deleted row is logged. TRUNCATE is usually not rollback-able in many systems because it’s DDL (though e.g., in MS SQL, if inside a transaction, it can be rolled back; in MySQL, cannot if using implicit commit). DROP is not typically rollback-able either (except if supported in transactional DDL contexts).
- Performance: TRUNCATE is faster than DELETE for bulk deletion of all rows, because it’s a minimal logging operation (drops and re-creates or resets the table storage). DROP is also fast and final. DELETE can be slow if deleting a large percentage of a table without any supporting indexes, as it logs each row and checks constraints for each deletion.
- Post-use state: After DELETE or TRUNCATE, the table exists (possibly empty). After DROP, the table does not exist anymore.
Example use cases: - Use DELETE when you want to remove specific records (e.g., “delete order ID 1000”) or when you need to potentially undo the operation or have it logged for auditing.
- Use TRUNCATE when you need to quickly clear out an entire table (e.g., a staging table you’re done with, or resetting test data) and you don’t need to log individual row deletions. It’s essentially like “reset this table to empty”.
- Use DROP when you no longer need the table at all (e.g., dropping a temporary table or completely removing an obsolete table from the schema).
Keep in mind foreign key constraints: if other tables reference the data, you may need to adjust those (like delete children first or disable constraints) before a truncate or drop. And always be cautious with these commands, especially DROP, because of their irreversible nature in practice.
- DELETE: This is a Data Manipulation Language (DML) command that removes rows from a table based on an optional
- How can you find the median value of a numeric column in SQL?
Answer: Calculating a median in SQL (which is the middle value in a sorted list) is a bit more involved than common aggregates like AVG or SUM because SQL doesn’t have a built-in median function in standard SQL. However, you can compute it using window functions or some clever querying. The approach can vary slightly by database, but here are a couple of approaches:- Using Window Functions: One approach is to use
ROW_NUMBER()
(orRANK()
) in conjunction with the total count of rows to pick the middle value(s). For example, using a common table expression (CTE) for clarity:WITH OrderedValues AS ( SELECT value_column, ROW_NUMBER() OVER (ORDER BY value_column) AS rn, COUNT(*) OVER () AS total_count FROM YourTable ) SELECT AVG(value_column) AS median FROM OrderedValues WHERE rn IN ((total_count + 1) / 2, (total_count + 2) / 2);
This works by assigning a row numberrn
to each row when sorted by the value. Thetotal_count
is the total number of rows. Iftotal_count
is odd,(total_count + 1) / 2
and(total_count + 2) / 2
will evaluate to the same middle index (due to integer division), effectively picking the middle row. Iftotal_count
is even, these will be the two middle indices, and the query takes the average of the two middle values. The result is the median. This method is fairly intuitive and works in many SQL dialects that support window functions and CTEs.
For example, if there are 7 rows, total_count=7, (7+1)/2 = 4 and (7+2)/2 = 4 (in integer math), so it picks rn = 4th row only. If there are 8 rows, total_count=8, (8+1)/2 = 4, (8+2)/2 = 5, so it averages rn = 4 and rn = 5. - Using Percentile Disc/Cont (if available): Some databases have special functions: for instance, PostgreSQL has
percentile_cont(0.5) WITHIN GROUP (ORDER BY value)
which gives the median (continuous 50th percentile). Oracle has similar analytic functions (PERCENTILE_CONT
andPERCENTILE_DISC
). If those are available, using them is simplest:SELECT PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY value_column) AS median FROM YourTable;
This will directly compute the median (for even count, percentile_cont interpolates – giving the average of middle two; percentile_disc would pick one or the other discrete value). SQL Server introducedAPPROX_PERCENTILE_CONT
in newer versions for large data (approximate median). - Self-join approach (not as efficient): In older systems without window functions, one could do a join to count how many numbers are less than a given number and how many are greater, etc. For median: find a value where half are less or equal and half are greater or equal. But that’s quite inefficient (O(n^2) checks). Another approach is using OFFSET in some systems: e.g., in MySQL:
SELECT value FROM YourTable ORDER BY value LIMIT 1 OFFSET floor(count/2)
, but handling even vs odd via SQL can be tricky (you’d do two queries or union). - Using a temporary table or variables (if procedural allowed): In some environments, you might select into an ordered temp table and then pick the middle by indexing (some have limit/offset). But window function method above is set-based and more elegant.
Example: Supposesales.amount
values are [5, 20, 1, 7, 9] (5 rows). Sorted: [1,5,7,9,20]. The median is 7 (the third value). The query with window functions would assign row numbers 1..5, total_count=5, pick row (5+1)/2=3 and (5+2)/2=3 -> row3 which is 7. If we added another value, say 11 (making it 6 values: [1,5,7,9,11,20], medians are 7 and 9, average = 8), the formula picks row (6+1)/2=3, and (6+2)/2=4 (3rd and 4th values, which are 7 and 9) and averages them to get 8.0.
Keep in mind performance: the window function method will sort the data (since ROW_NUMBER OVER ORDER BY requires sorting) – that’s O(n log n) which is typically fine for moderate result sets. For huge tables, computing median can be heavy. Some databases might optimize percentile_cont better.
In summary, to get a median in standard SQL, you often use window functions to locate the middle row(s) or use vendor-specific percentile functions if available. There’s no singleMEDIAN()
function in ANSI SQL as of now.
- Using Window Functions: One approach is to use
- What is the
RANK()
function in SQL, and how is it different from other ranking functions?
Answer: TheRANK()
function is a window function that assigns a rank to each row within a partition of a result set, based on the ordering of a specified column or expression. If two or more rows are tied (i.e., have equal values in the ordering), they receive the same rank, and a gap is left in the ranking sequence for the next value. This is the key characteristic ofRANK()
: it produces gaps in the ranking sequence when there are ties.
Other related ranking functions includeDENSE_RANK()
andROW_NUMBER()
:ROW_NUMBER()
: Assigns a unique sequential number to each row in the result set (within each partition if partitioned). If there are ties in ordering,ROW_NUMBER
will still give different sequential numbers arbitrarily (essentially breaks the tie by the order of appearance). It does not give the same number to equal values; no gaps either because it’s just sequential irrespective of ties.RANK()
: Gives equal rank to equal values, but if 2 rows are tied for rank 1, they both get rank 1 and the next rank will be 3 (since two ranks 1 were given, rank 2 is skipped). Example: If ordering by score descending and the top two scores are 95 (tie for highest), they each get rank 1; the next score, say 90, gets rank 3 (because two items had rank 1, the next rank is 1+2 = 3).DENSE_RANK()
: Similar to RANK in that equal values get the same rank, but it does not leave gaps in the sequence. In the above example, with dense rank, the two 95’s would be rank 1 and the next score 90 would get rank 2 (dense rank compresses the ranking sequence).
Usage: These functions are used with anOVER
clause that can havePARTITION BY
andORDER BY
. For example:
SELECT product_id, amount, RANK() OVER (ORDER BY amount DESC) as sales_rank FROM sales;
This would rank sales by amount (1 = highest amount). If two sales have the same amount, they get the same rank. If three sales tied for first, they all get rank 1 and the next rank shown would be 4 (since ranks 2 and 3 are skipped due to the three-way tie).
Practical difference: If you’re enumerating positions (like competition standings, or top N lists),RANK()
is useful because if two people tie for second place, you call them both “2nd place” and then the next is “4th place” (no 3rd place awarded). If you wanted the next to be called “3rd” anyway, you’d useDENSE_RANK()
. If you just want a unique number for each row regardless of ties (like assigning arbitrary IDs to rows in an ordered set), you useROW_NUMBER()
.
Example scenario:- You have a table of students with their exam scores. Using
RANK()
on score will give class rankings with possible gaps. If two students tied for top score, both get rank 1, the student(s) with the next lower score get rank 3 (skipping rank 2).UsingDENSE_RANK()
instead would give the next lower score rank 2 (no gap).UsingROW_NUMBER()
would arbitrarily give one of the top students rank 1 and the other rank 2, which isn’t fair in terms of ranking but might be needed if you must break ties for some listing.
Syntax to highlight ties handling:
SELECT score, RANK() OVER (ORDER BY score DESC) as rank, DENSE_RANK() OVER (ORDER BY score DESC) as dense_rank, ROW_NUMBER() OVER (ORDER BY score DESC) as row_num FROM exams;
If scores are [100, 100, 95, 90], RANK() yields [1,1,3,4], DENSE_RANK() yields [1,1,2,3], ROW_NUMBER() yields [1,2,3,4].
In summary,RANK()
provides ranking with gaps when there are ties, distinguishing it fromDENSE_RANK()
(no gaps) andROW_NUMBER()
(no ties considered, always unique sequence). Use the one that fits the ranking logic you need. - How can you join more than two tables in a single SQL query?
Answer: You can join multiple tables by chaining JOIN operations in the SQL query. SQL does not limit you to just two tables – you can join three, four, or many more by writing a series of JOIN clauses, each with its corresponding ON condition (except in the case of a chain of NATURAL joins or using comma joins, but those are not recommended). The general approach is to join two tables, then join the result to a third, and so on.
Example (3 tables):
Suppose we have three tables:Orders
,Customers
, andProducts
, and we want to retrieve orders with the customer name and product name. We can write:SELECT o.order_id, c.customer_name, p.product_name, o.quantity FROM Orders o JOIN Customers c ON o.customer_id = c.customer_id JOIN Products p ON o.product_id = p.product_id;
Here:- First,
Orders o
is joined toCustomers c
with the conditiono.customer_id = c.customer_id
. This pairwise join links each order to the corresponding customer.Then the resulting set (conceptually) is joined toProducts p
ono.product_id = p.product_id
. This adds product details to each order.
The execution might actually be optimized in different orders, but logically it’s as if you joined Orders to Customers, then that result to Products.
You could add a fourth table by anotherJOIN ... ON ...
clause at the end, and so on. Each additional join just needs a proper ON clause linking it to one of the already-joined tables (usually linking via a foreign key relationship or matching key).
Best practices:Use table aliases (as in the example) to keep the query readable, especially when joining many tables, and to clarify which table each column comes from.Ensure that join conditions are correct; missing a join condition (or using a CROSS JOIN intentionally/unintentionally) can result in a Cartesian product that multiplies rows incorrectly.When joining many tables, the database’s optimizer will decide the join order for efficiency (unless you use hints). It’s generally wise to have indexes on the join keys for performance.
Example (4 tables with different join types):
You can mix inner joins, left joins, etc. For instance:
SELECT e.name, d.department_name, p.project_name, m.manager_name FROM Employees e LEFT JOIN Departments d ON e.department_id = d.id LEFT JOIN Projects p ON e.project_id = p.id LEFT JOIN Managers m ON d.manager_id = m.id;
This would join 4 tables. Here we used LEFT JOINs to include employees even if some of the related info is missing (like an employee without a department still shows up with department NULL). The principle is the same – just multiple join clauses.
Alternate syntax (not recommended for new SQL): Historically, one could list tables in the FROM separated by commas and put all join conditions in the WHERE. E.g.,FROM Orders o, Customers c, Products p WHERE o.customer_id=c.customer_id AND o.product_id=p.product_id
. This is equivalent to inner joins. But explicitJOIN
syntax is preferred for clarity and for being able to use different join types.
Scaling up: It’s common in data warehousing to join a fact table with many dimension tables in one query (star schema queries often join a fact to 5-10 dimension tables). SQL handles that as long as the query is well-formed.
In summary, to join multiple tables, you simply keep addingJOIN <TableName> ON <condition>
clauses to your query for each additional table, taking care to properly connect each one with the existing tables in the query. - First,