Showing SQL On Jupyter Notebook Using DuckDB

September 30, 2023

Problem:

Need to analyze huge data using SQL and show the results on a Jupyter notebook.

Solution:

Recently DuckDB caught my attention. It is designed to be a fast online analytical processing (OLAP) DB that does not need separate DB server installation.

import duckdb

That’s it. Now we have a fully functional relational database. Connecting to the DB is as easy as:

conn = duckdb.connect('e-commerce.duckdb')

Let’s create a tiny dataset to test the power of DuckDB.

conn.execute("""
             -- Create the Products table
CREATE TABLE products (
    product_id INT PRIMARY KEY,
    product_name VARCHAR(255),
    price DECIMAL(10, 2),
    category VARCHAR(50)
);

-- Insert sample data into the Products table
INSERT INTO products (product_id, product_name, price, category)
VALUES
    (1, 'Laptop', 999.99, 'Electronics'),
    (2, 'Smartphone', 499.99, 'Electronics'),
    (3, 'Headphones', 79.99, 'Electronics'),
    (4, 'T-shirt', 19.99, 'Clothing'),
    (5, 'Jeans', 39.99, 'Clothing');

-- Create the Sessions table with marketing traffic columns
CREATE TABLE sessions (
    session_id INT PRIMARY KEY,
    user_id INT,
    utm_source VARCHAR(255),
    utm_type VARCHAR(255),
    device_type VARCHAR(50),
    session_date DATE
);

-- Insert sample data into the Sessions table
INSERT INTO sessions (session_id, user_id, utm_source, utm_type, device_type, session_date)
VALUES
    (1, 101, 'google', 'brand', 'mobile', '2023-09-28'),
    (2, 102, 'facebook', 'social', 'desktop', '2023-09-28'),
    (3, 103, 'twitter', 'social', 'mobile', '2023-09-27'),
    (4, 104, 'google', 'brand', 'mobile', '2023-09-27'),
    (5, 105, 'instagram', 'social', 'desktop', '2023-09-26');

-- Create the Orders table
CREATE TABLE orders (
    order_id INT PRIMARY KEY,
    user_id INT,
    session_id INT,
    product_id INT,
    quantity INT,
    order_date DATE
);

-- Insert sample data into the Orders table
INSERT INTO orders (order_id, user_id, session_id, product_id, quantity, order_date)
VALUES
    (1, 101, 1, 2, 2, '2023-09-28'),
    (2, 102, 2, 1, 1, '2023-09-28'),
    (3, 103, 3, 3, 1, '2023-09-27'),
    (4, 104, 4, 2, 3, '2023-09-27'),
    (5, 105, 5, 4, 2, '2023-09-26');
             """)
<duckdb.duckdb.DuckDBPyConnection at 0x2bd1c1cf470>

Now we can perform SQL query to process the data using sql() command. We can directly display the nicely formatted results on Jupyter notebook. Great! πŸš€

conn.sql('SELECT * FROM orders')
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ order_id β”‚ user_id β”‚ session_id β”‚ product_id β”‚ quantity β”‚ order_date β”‚
β”‚  int32   β”‚  int32  β”‚   int32    β”‚   int32    β”‚  int32   β”‚    date    β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚        1 β”‚     101 β”‚          1 β”‚          2 β”‚        2 β”‚ 2023-09-28 β”‚
β”‚        2 β”‚     102 β”‚          2 β”‚          1 β”‚        1 β”‚ 2023-09-28 β”‚
β”‚        3 β”‚     103 β”‚          3 β”‚          3 β”‚        1 β”‚ 2023-09-27 β”‚
β”‚        4 β”‚     104 β”‚          4 β”‚          2 β”‚        3 β”‚ 2023-09-27 β”‚
β”‚        5 β”‚     105 β”‚          5 β”‚          4 β”‚        2 β”‚ 2023-09-26 β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

A more complex query can be done too. Suppose we want to see a breakdown of product with the highest quantity for each day and curious to know which marketing campaign produced the traffic.

conn.sql(
    """
    WITH daily_product_rank AS (
        SELECT
        o.product_id as product_id,
        o.order_date as order_date,
        s.utm_type as utm_type,
        RANK () OVER (
            PARTITION BY o.order_date
            ORDER BY o.quantity
            ) as quantity_rank
        FROM orders o LEFT JOIN sessions s
            ON o.session_id = s.session_id
        ORDER BY o.order_date
    )
    SELECT *
    FROM daily_product_rank
    WHERE quantity_rank = 1
    """
)
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ product_id β”‚ order_date β”‚ utm_type β”‚ quantity_rank β”‚
β”‚   int32    β”‚    date    β”‚ varchar  β”‚     int64     β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚          4 β”‚ 2023-09-26 β”‚ social   β”‚             1 β”‚
β”‚          3 β”‚ 2023-09-27 β”‚ social   β”‚             1 β”‚
β”‚          1 β”‚ 2023-09-28 β”‚ social   β”‚             1 β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

The power of DuckDB can be combined with Pandas too.

import pandas as pd

df = conn.sql("""
         SELECT
            p.product_name,
            COUNT(DISTINCT o.order_id) as order_count,
         FROM orders o LEFT JOIN products p
            ON o.product_id = p.product_id
         GROUP BY p.product_name
         ORDER BY order_count DESC
         """).to_df()

df = df.set_index('product_name')
print(df.to_markdown())

| product_name   |   order_count |
|:---------------|--------------:|
| Smartphone     |             2 |
| T-shirt        |             1 |
| Laptop         |             1 |
| Headphones     |             1 |

Here I just give a simple bar chart example for products and their order counts. We can create visualizations with popular library such as Seaborn or Plotly as needed.

df.plot(kind='bar')
<Axes: xlabel='product_name'>

Awesome! It is important to note that DuckDB is not optimized for high write load. So it is not meant to replace other traditional databases, at least for now πŸ¦†.

Lastly, remember to close the connection.

Cheers!

conn.close()