Blog - RUCHITA SALUJA

Learn How To Write SQL Queries With Examples: #6

Data are becoming the new raw material of business.
Craig Mundie (President, Mundie & Associates | Former Senior Advisor to the CEO, Microsoft)

Question Source: LeetCode

Solution Language: MySQL

This Q&A series will cover data questions from LeetCode and present my solutions to them. Please feel free to comment with your suggestions if you feel that these problems may be solved in a more optimized manner.

Question (LeetCode Question #1270, Level: Medium)

Table: Employees

+---------------+---------+
| Column Name   | Type    |
+---------------+---------+
| employee_id   | int     |
| employee_name | varchar |
| manager_id    | int     |
+---------------+---------+
employee_id is the primary key for this table.
Each row of this table indicates that the employee with ID employee_id and name employee_name reports his work to his/her direct manager with manager_id
The head of the company is the employee with employee_id = 1.

Write an SQL query to find employee_id of all employees that directly or indirectly report their work to the head of the company.

The indirect relation between managers will not exceed 3 managers as the company is small.

Return the result table in any order without duplicates.

The query result format is in the following example:

Employees table:
+-------------+---------------+------------+
| employee_id | employee_name | manager_id |
+-------------+---------------+------------+
| 1           | Boss          | 1          |
| 3           | Alice         | 3          |
| 2           | Bob           | 1          |
| 4           | Daniel        | 2          |
| 7           | Luis          | 4          |
| 8           | Jhon          | 3          |
| 9           | Angela        | 8          |
| 77          | Robert        | 1          |
+-------------+---------------+------------+

Result table:
+-------------+
| employee_id |
+-------------+
| 2           |
| 77          |
| 4           |
| 7           |
+-------------+

The head of the company is the employee with employee_id 1.
The employees with employee_id 2 and 77 report their work directly to the head of the company.
The employee with employee_id 4 report his work indirectly to the head of the company 4 --> 2 --> 1. 
The employee with employee_id 7 report his work indirectly to the head of the company 7 --> 4 --> 2 --> 1.
The employees with employee_id 3, 8 and 9 don't report their work to head of company directly or indirectly.

Solution

This solution uses the Set operation “UNION ALL” instead of “UNION” because we do not anticipate any duplicates in the final result set. Here is my reasoning:

The only way that there could be duplicates in the final result set is if an employee reports to herself directly (employee_id = manager_id) as well as to employee_id 1 indirectly.
This can happen only if there is a data quality issue
To deal with this situation, I have added a condition in WHERE clause of every CTE subqueries to verify that for ny record, manager_id is not equal to employee_id

This approach is good for performance as well because we are filtering out any instances that could have led to duplicates, and then using “UNION ALL”, which gives us a higher-performing query as compared to one using the “UNION” set operation.

-- Direct Reports of Head of the company

WITH dr AS (select employee_id
 FROM Employees
 WHERE manager_id = 1 AND employee_id <>1),
 
 -- Indirect Reports of Head of the company (Levels 1 to 3)
 
 ir1 AS (SELECT e.employee_id 
         FROM Employees e JOIN dr
         ON e.manager_id = dr.employee_id
        AND e.manager_id <> e.employee_id),
         
 ir2 AS (SELECT e.employee_id 
         FROM Employees e JOIN ir1
         ON e.manager_id = ir1.employee_id
        AND e.manager_id <> e.employee_id),
                 
 ir3 AS (SELECT e.employee_id
         FROM Employees e JOIN ir2
         ON e.manager_id = ir2.employee_id
        AND e.manager_id <> e.employee_id)
         
SELECT employee_id FROM dr
UNION ALL
SELECT employee_id FROM ir1
UNION ALL
SELECT employee_id FROM ir2
UNION ALL
SELECT employee_id FROM ir3

Learn How To Write SQL Queries With Examples: #5

Information is the oil of the 21st century, and analytics is the combustion engine.
Peter Sondergaard (Former EVP, Research & Advisory – Gartner)

Question Source: LeetCode

Solution Language: MySQL

Question (LeetCode Question #1412, Level: Hard)

Table: Student

+---------------------+---------+
| Column Name         | Type    |
+---------------------+---------+
| student_id          | int     |
| student_name        | varchar |
+---------------------+---------+
student_id is the primary key for this table.
student_name is the name of the student.

Table: Exam

+---------------+---------+
| Column Name   | Type    |
+---------------+---------+
| exam_id       | int     |
| student_id    | int     |
| score         | int     |
+---------------+---------+
(exam_id, student_id) is the primary key for this table.
Student with student_id got score points in exam with id exam_id.

A “quiet” student is the one who took at least one exam and didn’t score neither the high score nor the low score.

Write an SQL query to report the students (student_id, student_name) being “quiet” in ALL exams.

Don’t return the student who has never taken any exam. Return the result table ordered by student_id.

The query result format is in the following example.

Student table:
+-------------+---------------+
| student_id  | student_name  |
+-------------+---------------+
| 1           | Daniel        |
| 2           | Jade          |
| 3           | Stella        |
| 4           | Jonathan      |
| 5           | Will          |
+-------------+---------------+

Exam table:
+------------+--------------+-----------+
| exam_id    | student_id   | score     |
+------------+--------------+-----------+
| 10         |     1        |    70     |
| 10         |     2        |    80     |
| 10         |     3        |    90     |
| 20         |     1        |    80     |
| 30         |     1        |    70     |
| 30         |     3        |    80     |
| 30         |     4        |    90     |
| 40         |     1        |    60     |
| 40         |     2        |    70     |
| 40         |     4        |    80     |
+------------+--------------+-----------+

Result table:
+-------------+---------------+
| student_id  | student_name  |
+-------------+---------------+
| 2           | Jade          |
+-------------+---------------+

For exam 1: Student 1 and 3 hold the lowest and high score respectively.
For exam 2: Student 1 hold both highest and lowest score.
For exam 3 and 4: Studnet 1 and 4 hold the lowest and high score respectively.
Student 2 and 5 have never got the highest or lowest in any of the exam.
Since student 5 is not taking any exam, he is excluded from the result.
So, we only return the information of Student 2.

Solution

WITH cte AS 
(SELECT student_id, 
             score, 
             exam_id,
             (CASE WHEN score < MAX(score) OVER (PARTITION BY exam_id)
             AND score > MIN(score) OVER (PARTITION BY exam_id)
             THEN 'middle'
             ELSE 'highlow'
             END) AS category
FROM Exam
ORDER BY student_id),

cte1 AS (SELECT student_id
         FROM cte
         GROUP BY student_id
         HAVING SUM(CASE WHEN category = 'highlow'
                    THEN 1 ELSE 0
                    END) = 0
)

SELECT cte1.student_id, s.student_name
FROM cte1 JOIN Student s
ON cte1.student_id = s.student_id
ORDER BY cte1.student_id

Alternate Approaches…

WITH cte AS 
     (SELECT student_id, score, exam_id,
            max(score) OVER (PARTITION BY exam_id) AS maxscore,
            min(score) OVER (PARTITION BY exam_id) AS minscore
      FROM Exam),
      cte1 AS 
            (SELECT student_id
                  FROM cte
                  WHERE score = maxscore OR score = minscore
             )
SELECT DISTINCT Exam.student_id, Student.student_name
      FROM Exam JOIN Student
      ON Exam.student_id = Student.student_id
WHERE Exam.student_id NOT IN (SELECT student_id FROM cte1)
      ORDER BY Exam.student_id

WITH cte AS (SELECT student_id,
             rank() OVER (PARTITION BY exam_id ORDER BY score DESC) 
             AS gethighest,
             rank() OVER (PARTITION BY exam_id ORDER BY score ASC) 
             AS getlowest
             FROM Exam),
     cte1 AS (SELECT DISTINCT student_id,
              SUM(CASE WHEN gethighest = 1 
                            OR getlowest = 1 
                       THEN 1 
                       ELSE 0 END) 
                  OVER (PARTITION BY student_id ORDER BY student_id)
                  AS numofhighlow
              FROM cte)
SELECT cte1.student_id, student_name
      FROM cte1 JOIN Student
      ON cte1.student_id = Student.student_id
      WHERE cte1.numofhighlow = 0

Learn How To Write SQL Queries With Examples: #4

If somebody tortures the data enough (open or not), it will confess anything.
Paolo Magrassi, (Former vice president, research director, Gartner)

Question Source: LeetCode

Solution Language: MySQL

Question (LeetCode Question #262, Level: Hard)

Table: Trips

+-------------+----------+
| Column Name | Type     |
+-------------+----------+
| Id          | int      |
| Client_Id   | int      |
| Driver_Id   | int      |
| City_Id     | int      |
| Status      | enum     |
| Request_at  | date     |     
+-------------+----------+
Id is the primary key for this table.
The table holds all taxi trips. Each trip has a unique Id, while Client_Id and Driver_Id are foreign keys to the Users_Id at the Users table.
Status is an ENUM type of (‘completed’, ‘cancelled_by_driver’, ‘cancelled_by_client’).

Table: Users

+-------------+----------+
| Column Name | Type     |
+-------------+----------+
| Users_Id    | int      |
| Banned      | enum     |
| Role        | enum     |
+-------------+----------+
Users_Id is the primary key for this table.
The table holds all users. Each user has a unique Users_Id, and Role is an ENUM type of (‘client’, ‘driver’, ‘partner’).
Banned is an ENUM type of (‘Yes’, ‘No’).

Write a SQL query to find the cancellation rate of requests with unbanned users (both client and driver must not be banned) each day between "2013-10-01" and "2013-10-03".

The cancellation rate is computed by dividing the number of canceled (by client or driver) requests with unbanned users by the total number of requests with unbanned users on that day.

Return the result table in any order. Round Cancellation Rate to two decimal points.

The query result format is in the following example:

Trips table:
+----+-----------+-----------+---------+---------------------+------------+
| Id | Client_Id | Driver_Id | City_Id | Status              | Request_at |
+----+-----------+-----------+---------+---------------------+------------+
| 1  | 1         | 10        | 1       | completed           | 2013-10-01 |
| 2  | 2         | 11        | 1       | cancelled_by_driver | 2013-10-01 |
| 3  | 3         | 12        | 6       | completed           | 2013-10-01 |
| 4  | 4         | 13        | 6       | cancelled_by_client | 2013-10-01 |
| 5  | 1         | 10        | 1       | completed           | 2013-10-02 |
| 6  | 2         | 11        | 6       | completed           | 2013-10-02 |
| 7  | 3         | 12        | 6       | completed           | 2013-10-02 |
| 8  | 2         | 12        | 12      | completed           | 2013-10-03 |
| 9  | 3         | 10        | 12      | completed           | 2013-10-03 |
| 10 | 4         | 13        | 12      | cancelled_by_driver | 2013-10-03 |
+----+-----------+-----------+---------+---------------------+------------+

Users table:
+----------+--------+--------+
| Users_Id | Banned | Role   |
+----------+--------+--------+
| 1        | No     | client |
| 2        | Yes    | client |
| 3        | No     | client |
| 4        | No     | client |
| 10       | No     | driver |
| 11       | No     | driver |
| 12       | No     | driver |
| 13       | No     | driver |
+----------+--------+--------+

Result table:
+------------+-------------------+
| Day        | Cancellation Rate |
+------------+-------------------+
| 2013-10-01 | 0.33              |
| 2013-10-02 | 0.00              |
| 2013-10-03 | 0.50              |
+------------+-------------------+

On 2013-10-01:
  - There were 4 requests in total, 2 of which were canceled.
  - However, the request with Id=2 was made by a banned client (User_Id=2), so it is ignored in the calculation.
  - Hence there are 3 unbanned requests in total, 1 of which was canceled.
  - The Cancellation Rate is (1 / 3) = 0.33
On 2013-10-02:
  - There were 3 requests in total, 0 of which were canceled.
  - The request with Id=6 was made by a banned client, so it is ignored.
  - Hence there are 2 unbanned requests in total, 0 of which were canceled.
  - The Cancellation Rate is (0 / 2) = 0.00
On 2013-10-03:
  - There were 3 requests in total, 1 of which was canceled.
  - The request with Id=8 was made by a banned client, so it is ignored.
  - Hence there are 2 unbanned request in total, 1 of which were canceled.
  - The Cancellation Rate is (1 / 2) = 0.50

Solution

Approach With Joins:

WITH temp AS
(SELECT DISTINCT t.Request_at AS Day,
COUNT(CASE WHEN Status <> ‘completed’ THEN Id ELSE null END) OVER
(PARTITION BY t.Request_at) AS canceled,
COUNT(Id) OVER(PARTITION BY t.Request_at) AS total
FROM Trips t JOIN Users uc JOIN Users ud
ON t.Client_Id = uc.Users_Id
AND t.Driver_Id = ud.Users_Id
WHERE uc.Banned = ‘No’ AND ud.Banned = ‘No’
AND t.Request_at BETWEEN CAST(‘2013-10-01’ AS Date)
AND CAST(‘2013-10-03’ AS Date))
SELECT Day,
CAST(canceled/total AS DECIMAL(65,2)) AS ‘Cancellation Rate’
FROM temp

Alternate Approach (Without Joins)

SELECT Request_at AS Day,
CAST(COUNT(IF(Status != ‘completed’, true, null)) / COUNT(Id) AS DECIMAL(65,2))
AS ‘Cancellation Rate’
FROM Trips
WHERE Request_at BETWEEN ‘2013-10-01’ AND ‘2013-10-03’
AND Client_id IN (SELECT Users_Id FROM Users WHERE Banned = ‘No’)
AND Driver_Id IN (SELECT Users_Id FROM Users WHERE Banned = ‘No’)
GROUP BY Request_at;

Learn How To Write SQL Queries With Examples: #3

Without big data analytics, companies are blind and deaf, wandering out onto the web like deer on a freeway.
Geoffrey Moore, management consultant and author of Crossing the Chasm

Question Source: LeetCode

Solution Language: MySQL

Question (LeetCode Question #185, Level: Hard)

Table: Employee

+--------------+---------+
| Column Name  | Type    |
+--------------+---------+
| Id           | int     |
| Name         | varchar |
| Salary       | int     |
| DepartmentId | int     |
+--------------+---------+
Id is the primary key for this table.
Each row contains the ID, name, salary, and department of one employee.

Table: Department

+-------------+---------+
| Column Name | Type    |
+-------------+---------+
| Id          | int     |
| Name        | varchar |
+-------------+---------+
Id is the primary key for this table.
Each row contains the ID and the name of one department.

A company’s executives are interested in seeing who earns the most money in each of the company’s departments. A high earner in a department is an employee who has a salary in the top three unique salaries for that department.

Write an SQL query to find the employees who are high earners in each of the departments.

Return the result table in any order.

The query result format is in the following example:

Employee table:
+----+-------+--------+--------------+
| Id | Name  | Salary | DepartmentId |
+----+-------+--------+--------------+
| 1  | Joe   | 85000  | 1            |
| 2  | Henry | 80000  | 2            |
| 3  | Sam   | 60000  | 2            |
| 4  | Max   | 90000  | 1            |
| 5  | Janet | 69000  | 1            |
| 6  | Randy | 85000  | 1            |
| 7  | Will  | 70000  | 1            |
+----+-------+--------+--------------+

Department table:
+----+-------+
| Id | Name  |
+----+-------+
| 1  | IT    |
| 2  | Sales |
+----+-------+

Result table:
+------------+----------+--------+
| Department | Employee | Salary |
+------------+----------+--------+
| IT         | Max      | 90000  |
| IT         | Joe      | 85000  |
| IT         | Randy    | 85000  |
| IT         | Will     | 70000  |
| Sales      | Henry    | 80000  |
| Sales      | Sam      | 60000  |
+------------+----------+--------+

In the IT department:
- Max earns the highest unique salary
- Both Randy and Joe earn the second-highest unique salary
- Will earns the third-highest unique salary

In the Sales department:
- Henry earns the highest salary
- Sam earns the second-highest salary
- There is no third-highest salary as there are only two employees

Solution

WITH temp AS
(SELECT Name AS Employee,
Salary,
DepartmentId,
DENSE_RANK() OVER (PARTITION BY DepartmentId ORDER BY Salary DESC) AS rnk
FROM Employee)
SELECT d.Name AS Department,
temp.Employee,
temp.Salary
FROM temp JOIN Department d
ON temp.DepartmentId = d.Id
WHERE temp.rnk<=3

Difference between Rank() and Dense_Rank() Window Functions

In the above solution, I have used the Dense_Rank() function instead of the Rank() function. This is because the question describes a high earner in a department as “an employee who has a salary in the top three unique salaries for that department”.

To better illustrate the difference, check out the rank allocated to each employee in both cases (Note: I have displayed rank for all employees instead of just the top 3):

Result table:
+----+----------+--------+--------------+--------+--------------+
| Id | Employee | Salary | DepartmentId | Rank() | Dense_Rank() |
+----+----------+--------+--------------+--------+--------------+
| 4  | Max      | 90000  |       1      |   1    |       1      |
| 1  | Joe      | 85000  |       1      |   2    |       2      |
| 6  | Randy    | 85000  |       1      |   2    |       2      |
| 7  | Will     | 70000  |       1      |   4    |       3      |
| 5  | Janet    | 69000  |       1      |   5    |       4      |
| 2  | Henry    | 80000  |       2      |   1    |       1      |
| 3  | Sam      | 60000  |       2      |   2    |       2      |
+----+----------+--------+--------------+--------+--------------+

As you can see, in case of Rank(), a rank is skipped after the same ranks:
Joe and Randy rank 2 in Department 1, but Will ranks 4 instead of 3 (a rank is skipped).

But we needed Will to rank 3 instead of 4, so that we can filter by the condition rank<=3 to get employees whose salaries are in the top 3 distinct salaries in each department.

For achieving this goal, we use Dense_Rank() function. As you can see above, Will ranks 3 instead of 4 in the Dense_Rank() column

Learn How To Write SQL Queries With Examples: #2

Data Is A Precious Thing And Will Last Longer Than The Systems Themselves.
Sir Tim Berners-Lee (The inventor of the World Wide Web)

Question Source: LeetCode

Solution Language: MySQL

Question (LeetCode Question #177, Level: Medium)

Write a SQL query to get the n^th highest salary from the Employee table.

+----+--------+
| Id | Salary |
+----+--------+
| 1  | 100    |
| 2  | 200    |
| 3  | 300    |
+----+--------+

For example, given the above Employee table, the n^th highest salary where n = 2 is 200. If there is no n^th highest salary, then the query should return null.

+------------------------+
| getNthHighestSalary(2) |
+------------------------+
| 200                    |
+------------------------+

Solution

CREATE FUNCTION getNthHighestSalary(N INT) RETURNS INT
BEGIN
SET N = N-1;
RETURN (
Select
CASE when count(distinct Salary) <= N then null
else
(select distinct Salary
from Employee
order by Salary desc limit 1 offset N)
end
from Employee
);
END

NOTE:

MySQL LIMIT and OFFSET syntax can only take numeric constants
LIMIT 1 OFFSET N can also be written as LIMIT N,1

Question (LeetCode Question #184, Level: Medium)

The Employee table holds all employees. Every employee has an Id, a salary, and there is also a column for the department Id.

+----+-------+--------+--------------+
| Id | Name  | Salary | DepartmentId |
+----+-------+--------+--------------+
| 1  | Joe   | 70000  | 1            |
| 2  | Jim   | 90000  | 1            |
| 3  | Henry | 80000  | 2            |
| 4  | Sam   | 60000  | 2            |
| 5  | Max   | 90000  | 1            |
+----+-------+--------+--------------+

The Department table holds all departments of the company.

+----+----------+
| Id | Name     |
+----+----------+
| 1  | IT       |
| 2  | Sales    |
+----+----------+

Write a SQL query to find employees who have the highest salary in each of the departments. For the above tables, your SQL query should return the following rows (order of rows does not matter).

+------------+----------+--------+
| Department | Employee | Salary |
+------------+----------+--------+
| IT         | Max      | 90000  |
| IT         | Jim      | 90000  |
| Sales      | Henry    | 80000  |
+------------+----------+--------+

Explanation:

Max and Jim both have the highest salary in the IT department and Henry has the highest salary in the Sales department.

Solution

Approach 1:

WITH temp AS
(SELECT d.Id, d.Name, MAX(e.Salary) AS Salary
FROM Employee e JOIN Department d
ON e.DepartmentId = d.Id
GROUP BY 1,2
)
SELECT temp.Name AS Department, Employee.Name AS Employee, Employee.Salary
FROM Employee JOIN temp
ON Employee.DepartmentId = temp.Id
WHERE Employee.Salary = temp.Salary;

Approach 2:

WITH temp AS (
SELECT DepartmentId, Name, Salary,
RANK() OVER (PARTITION BY DepartmentId ORDER BY Salary DESC) AS rnk
FROM Employee)
SELECT d.Name AS Department, temp.NAME AS Employee, temp.SALARY AS Salary
FROM temp
JOIN Department d
ON d.Id = temp.DepartmentId AND rnk = 1

Approach 3

SELECT Department, Employee, Salary
FROM (SELECT
d.Name AS Department,
e.Name AS Employee,
Salary,
RANK() OVER (PARTITION BY e.DepartmentId ORDER BY e.Salary DESC) AS rnk
FROM Employee e JOIN Department d
ON e.DepartmentId = d.Id) AS temp
WHERE rnk = 1

NOTE:

Approach 2 and 3 use “Window Functions” to solve this problem.
Window functions are used to optimize queries for efficiency and reduce query complexity when querying large datasets.
To learn more about the Window functions used in MySQL 8.0, click here.
Another source for learning about Window functions is Mode Analytics.

Learn How To Write SQL Queries With Examples: #1

The goal is to turn data into information, and information into insight.
Carly Fiorina (Former CEO of Hewlett-Packard)

Question Source: LeetCode

Solution Language: MySQL

Question (LeetCode Question #176, Level: Easy)

Write a SQL query to get the second highest salary from the Employee table.

+----+--------+
| Id | Salary |
+----+--------+
| 1  | 100    |
| 2  | 200    |
| 3  | 300    |
+----+--------+

For example, given the above Employee table, the query should return ‘200' as the second highest salary. If there is no second-highest salary, then the query should return null.

+---------------------+
| SecondHighestSalary |
+---------------------+
| 200                 |
+---------------------+

Solution

Select Max(Salary) as SecondHighestSalary
From Employee
Where Salary not in (Select Max(Salary) From Employee)

Question (LeetCode Question #181, Level: Easy)

Write a SQL query that finds out employees who earn more than their managers.

The Employee table holds all employees including their managers. Every employee has an Id, and there is also a column for the manager Id.

+----+-------+--------+-----------+
| Id | Name  | Salary | ManagerId |
+----+-------+--------+-----------+
| 1  | Joe   | 70000  | 3         |
| 2  | Henry | 80000  | 4         |
| 3  | Sam   | 60000  | NULL      |
| 4  | Max   | 90000  | NULL      |
+----+-------+--------+-----------+

Given the Employee table, write a SQL query that finds out employees who earn more than their managers. For the above table, Joe is the only employee who earns more than his manager.

+----------+
| Employee |
+----------+
| Joe      |
+----------+

Solution

SELECT e.Name AS Employee
FROM employee e JOIN employee m
ON e.ManagerID = m.Id AND e.Salary > m.Salary

Alternative Approach (Faster Query):

select e.Name as Employee
from (select e.* , m.Salary as MgrSalary
from employee e join employee m
on e.ManagerID = m.Id) AS temp
WHERE temp.Salary > temp.Salary

Successful Product Ownership: What Does It Look Like?

Product Owner is no less than a super hero

Be stubborn on vision but flexible on details
Jeff Bezos

I was looking through courses in LinkedIn Learning today when I came across a course on Foundations of an Agile Product Owner role. I absolutely loved going through this course for the following reasons:

It clearly walks through what a day in a Product Owner’s shoes looks like.
- Meetings
- Negotiations
- Reviews
- Analyses
- Facilitations
- Relationship Building
It walks through the set of skills required to be successful in this role.
- Relationship Building
- Analysis
- Decision Making
- Leadership and Communication
- Value Analysis
- Facilitation
This course covers in a very easy-to-understand manner the work that goes into product ownership starting from product vision, roadmaps, backlogs, refinement, to agile planning (That includes information about sprint/release planning, etc.)
This course also provides the following credits:
- International Institute of Business Analysis™ (IIBA®) –> Continuing Development Units (CDUs) : 1.5
- National Association of State Boards of Accountancy (NASBA) –> Continuing Professional Education Credit (CPE): 2.2
- Project Management Institute (PMI)® –> PDUs/Contact Hours: 1

If you are someone, trying to understand what a Product Owner does on a day to day basis, or just trying to become better at this role, this course is a good reference. Please find the link to the course below:

Agile Product Owner Role: Foundations

That said, I have summarized my learnings from the course in the embedded PowerPoint deck below. Also, if you’d like to learn more about value-based problem solving, do check out my post on the topic by clicking here:

Value-Based Problem-Solving

SQL Joins: A Guide That Makes It Stupid – Simple

Post author By RUCHITA SALUJA
Post date June 16, 2021
Categories In Learning SQL

Share This Post:

Businesses across the globe have evolved over the past decade. With companies striving to Digitally Transform their value chains and become more Lean and Agile, Analysts and Program/Project/Product Managers of today need to become data-driven problem solvers. To solve problems using data, one needs to access, analyze and report on the business data. However, with the amount of data collected in current times, Excel spreadsheets are neither feasible nor sustainable for this purpose.

This brings me to a solution that lets us access and manipulate large data sets. Yes, I am talking about the most commonly used data querying language – SQL (Structured Query Language).

What is SQL?

Per Mode Analytics, SQL is a programming language that is semantically easy to understand and learn and is used to access large amounts of data directly from the data source.

“SQL is great for performing the types of aggregations that you might normally do in an Excel pivot table—sums, counts, minimums and maximums, etc.—but over much larger datasets and on multiple tables at the same time.”
MODE ANALYTICS

SQL becomes truly powerful when we join multiple data sources using certain Join conditions to answer our questions. So what are the different ways we can join data tables using SQL?

Types of Joins in SQL

Outer Joins

1. Left Join

We have seen numerous resources that explain a Left (Outer) Join with a Venn Diagram that looks somewhat like this:

However, I believe that the above Venn diagram doesn’t exactly give clarity on this type of join. One might not be able to completely grasp the final outcome of such a join.

In a Left Join, as shown above, in the Venn Diagram, all the records from Table A are present in the result set. Any record with no matching data in Table B has”Null” values in place of Table B values.

Let’s consider the below example wherein there are 2 tables: EMPLOYEE and ADDRESS. Let’s say that the employee John Doe doesn’t have an address listed in the Address table. In addition to that, the employee Jill Mcfee has 2 separate addresses in the ADDRESS table. So when we “Left Join” the EMPLOYEE table with the ADDRESS table, the result table will look as demonstrated below.

SELECT e.First_Name, e.Last_Name, a.City, a.State
FROM EMPLOYEE e
LEFT JOIN ADDRESS a
ON a.Employee_ID = e.Employee_ID;

NOTE: The result table above has a total of 4 rows instead of 3 because the employee record of Jill Mcfee was joined to 2 addresses on the ADDRESS table, resulting in 2 records.

2. Right Join

This type of join is depicted as below in Venn Diagram format:

All the records from Table B are present in the result set. Any record with no matching data in Table A has “Null” values in place of Table A values.

To give better clarity on the final result of the join function, I’ll use the example from Left Join. The below illustration shows how a Right Join works.

SELECT e.First_Name, e.Last_Name, a.City, a.State
FROM EMPLOYEE e
RIGHT JOIN ADDRESS a
ON a.Employee_ID = e.Employee_ID;

3. Full Outer Join

A Full Outer Join includes all records from both Left and Right Tables, with the missing values represented as Null values.

We can better illustrate the Full Outer Join by using the example from the Left Join section instead of the Venn Diagram above. The illustration below shows a Full Outer Join between EMPLOYEE and ADDRESS tables.

SELECT e.First_Name, e.Last_Name, a.City, a.State
FROM EMPLOYEE e
FULL OUTER JOIN ADDRESS a
ON a.Employee_ID = e.Employee_ID;

SELECT e.First_Name, e.Last_Name, a.City, a.State
FROM EMPLOYEE e
FULL JOIN ADDRESS a
ON a.Employee_ID = e.Employee_ID;

In this example, employee John Doe doesn’t have any address in the ADDRESS table, and therefore, has Null values in City and State columns in the result table. Similarly, there is no employee in the EMPLOYEE table for Address ID = 104 ad Employee ID = 4. Therefore, the values for the First Name and Last Name columns will be Null values in the result table for this record.

Inner Join

Inner Join represents the Intersection of 2 Tables. In other words, the result set contains only those records that have matching values in both tables.

In our example, the inner join between tables EMPLOYEE and ADDRESS is illustrated below.

We can build the query for an inner join in 2 ways:

SELECT e.First_Name, e.Last_Name, a.City, a.State
FROM EMPLOYEE e
INNER JOIN ADDRESS a
ON a.Employee_ID = e.Employee_ID;

SELECT e.First_Name, e.Last_Name, a.City, a.State
FROM EMPLOYEE e
JOIN ADDRESS a
ON a.Employee_ID = e.Employee_ID;

A Concluding Note…

This article talks about inner and outer joins, and then types of outer joins. While in this article, I discussed only those scenarios where we join 2 tables, we often need to join more than 2 tables. That’s when we truly start seeing the complexity increase.

I such cases, the order of the joins becomes important.

For instance, Table A LEFT JOIN Table B LEFT JOIN Table C is not the same as Table C LEFT JOIN Table A LEFT JOIN Table B

More to come on the importance of order of joins. Stay tuned…

A Simple Value-Based Problem-Solving Approach

Share This Post:

“The interesting thing about business, it’s not like the Olympics. You don’t get any extra points for the fact that something’s very hard to do. So you might as well just step over one-foot bars, instead of trying to jump over seven-foot bars.”
Warren Buffett

How does someone end up in such a situation where their solution “Just Doesn’t Cut It” with the customer?

This situation can arise due to a couple reasons:

Misunderstanding the problem the customer is trying to solve
Shiny Object Syndrome – Being biased towards the solution – tools/processes/features
The value that the solution delivers is not enough to warrant the time and effort invested from the customer’s perspective
- A customer may be internal or external to one’s team/company.

It’s easy to get caught up in continuously investing time and resources into solutions that ultimately don’t meet user expectations. While it is not wrong to pursue perfection in one’s work or build a solution that one truly believes in, finding out about user discontent at the time of solution delivery is an expensive mistake. Such mistakes come at the cost of time, effort, and customer trust. Moreover, if the business needs are urgent, there may not even be scope for making such misses.

Is there a way to ensure user acceptance early on, even before any work is initiated?

In my experience, the best method is to always start with understanding the problem that one is trying to solve. You may find your stakeholders requesting certain work output that they are sure would solve their problems. But more often than not, the requested work product just resolves symptoms of the problem at hand, and not the problem itself. Therefore, in such situations, it becomes vital to ask the “Why?”. Understanding the “why” prevents us from investing time and effort in fulfilling the request, only to have the work scrapped to be started from scratch.

We understand the problem now. What next?

The next step is the solution design phase of problem-solving. The goal of Solution Design is to provide stakeholders with visibility into how their business requirements will be met. In other words, a solution design is the blueprint of the solution. Once we have this blueprint, it must be reviewed by involving the strategic stakeholders to ensure that valuable inputs are received upfront. This helps us progress in the correct direction, and hence, increases the odds of successful user adoption.

We have the stakeholder buy-in now. What should the work be structured as?

Once the initial design decisions are made and reviewed by relevant stakeholders, our work can be started to bring the proof of concept to life.

Long gone are the days when the popular practice used to be: gather project requirements, get stakeholder approval, execute the solution – hoping that nothing changes between requirement review and solution delivery, and then finally communicate with them in one of the two scenarios:

The solution was ready to share
Timelines need to be shifted changed

The above approach, in the software world, is known as the SDLC Waterfall model. Whether in Software Development or in non-tech project management, this approach has proved to be ineffective in gaining customer satisfaction. That said, an incremental approach to building the solution helps with stakeholder buy-ins and ensures that relevant feedback is received early on before much time and effort are invested in the wrong direction.

Yes, I am referring to being Agile – and no, the Agile approach is not just limited to the Software industry. But, what does it mean to be Agile?

An agile method relies upon incremental and iterative completion of goals with a self-managing team.
Read more

There is no fixed set of rules and tools for incremental problem-solving. The solution that is being incrementally built can take any form depending on the complexity of the problem and the stage of problem-solving. This could range from a simple flowchart to a well-built out Minimum Viable Product (MVP).

To conclude, I’d just like to assert that:

Building a product or a process must be user-focused and therefore should follow an incremental problem-solving approach. This ensures a robust feedback loop is set up with customers/strategic stakeholders, increasing the adoption/success odds.

Whether you are trying to build products, processes, or services, I’d highly recommend reading this book:

Product Roadmaps Relaunched: How to Set Direction while Embracing Uncertainty

Read this article by following this link

My First Visualization Project on Tableau Public

Share This Post:

I’d been thinking about trying out Tableau Public for quite some time. This weekend, I finally got to it. The first step was to select the dataset I wanted to use and determine the questions to be answered by analyzing this data. So, I decided to pick a topic that’s of interest to anyone on a work visa like myself. In the visualization shared below, I have analyzed USCIS data for H1B visa applications through 2009-2020. Through this analysis, I wanted to see the real impact on visa approvals post 2016 and understand who was really impacted.

Work Visa Approvals post-2016 Elections

Check out Tableau Public to see the latest “Viz of the Day”