Abstract nonsense lurking beneath SQL

This is part of the “machine learning for disillusioned mathematicians” series. Let’s get straight to it.

Visualizing tables

A cartoon of a table with only one column is a line with a bunch of dots.

one column table

Each dot represents a “row” of the table. Similarly, a table with two columns is cartoonified as a bunch of line segments between two lines.

two column table

Here, each line segment corresponds to a row of the table. Naturally, a table with three columns is visualized as a bunch of triangles

three column table

And for four column tables, we use tetrahedra, but I can’t draw it because I’m lazy. \(N\)-column tables are visualized as a bunch of \(N\)-simplices, and I don’t need to draw it because it’s sooooooo easy to visualize an \(N\)-simplex if you are an inter-dimensional being.

nullible fields?

If a field can take on a null value, we get slightly different visuals. For example, a three column table where the third field is nullible is visualized as

nullible table

What one should note is that there are both triangles and line segments (as opposed to purely triangles as in the non-nullible case). So nullible fields allow for mixed shapes.

Visualizing Unions and Joins

This is a cartoon of what UNION looks like

union

This is a cartoon of an INNER JOIN

inner join

An Outer Join

outer join

And finally, a cross Join

cross_join join

Perhaps you think that was abstract nonsense?

… Hold my beer.

My spidey-sense tells me that INNER JOIN is a limit of the diagram

\[\begin{array}{cc} \bullet & \bullet \\ \downarrow & \downarrow \\ \bullet & \bullet \end{array}\]

Moreover, OUTER JOIN is a co-limit of the same diagram.

Finally, SQL’s CROSS JOIN is a limit of the diagram

\(\bullet \quad \bullet\).

I’ll take my beer back now.

I’m going to puke

I apologize. With only 4 lines of text and two diagrams, I probably ejected 90% of my potential readership. That was clearly an unhealthy amount of abstraction.

There really is no reason to write/read a blog post on SQL joins because they are such a simple concepts. The writer of such an article would need to be so stupid, so ignorant, that he/she would probably stoop as low as redefining the most basic concept, like “tables”.

The notion of a table

These cartoons suggest that a table is abstractly represented as some sort of collection of \(N\)-simplices (possibly of varying dimension). However, it’s nut just a set. It’s a multiset. We can formally represent a table as a multiset equipped with a collection of morphisms called “columns”. Lets unpack this.

Multi-what?

A set is just a bag of distinct things. For example \(\{1,2,3\}\) is a set. If you take the set \(\{1,2,3\}\) and insert a \(1\) again, you still get \(\{1,2,3\}\), as opposed to something that has two copies of \(1\). A set has no notion of multiplicity.

Enhancing sets with multiplicity yields the concept of a multiset as a pair \((S,m)\) where \(S\) is a set and \(m: S \to \mathbb{N}\) denotes the multiplicity of each element. Here \(\mathbb{N} = \{ 1, 2, 3, \dots \}\) denotes the natural numbers.

I like to think of multisets as the toy models of measures. Just like measure, the natural action of a map on a multiset is a push-forward. A morphism from \((X,m)\) to \((Y,n)\) is a (set) map \(f:X \to Y\) such that the diagram

multiset morphism

commutes. The collection of multisets along with these arrows form a category I will call \({\rm Multiset}\).

Tables in a relational database are multisets, as opposed to just plain old sets, because it’s possible to have duplicate rows in a table.

Example: a two column table

Now there are some deranged freaks who insist that a table is something that looks like

f_0	f_1
1	`true`
2	`true`
2	`true`
3	`false`

I call these sick and twisted individuals “normal people”. However, it’s perfectly valid to represent this so-called “table” as the multiset \((S, m)\) where \(S = \{(1,true),(2,true),(3,false)\}\) and

\[\begin{align*} m(x,y) = \begin{cases} 1 & \text{if } (x,y) = (1,true) \text{ or } (3,false) \\ 2 & \text{if } (x,y) = (2,true) \\ \end{cases} \end{align*}\]

The columns are the maps \(f_0(x,y) = x\) and \(f_1(x,y) = y\) for \((x,y) \in S\). So we see a table is a multiset.

Example: nullible fields

You can represent a table with a nullible field such as

f_0	f_1
1	`true`
2	`true`
2	`true`
3	`false`
2	`NULL`

by considering the set \(S = \{(1,true),(2,true),(3,false), 2 \}\), and the same multiplicity function as the previous section, but extended by defining \(m(2)=1\). In any case, NULL is naturally accounted for by multisets, and NULL entries are merely an artifact that occasionally arises when we display a multiset as a (normal person’s) table.

The `DISTINCT` monad

The forgetful functor \(F: (S,m) \in {\rm Multiset} \mapsto S \in {\rm Set}\) has an adjoint, \(G:{\rm Set} \to {\rm Multiset}\) given by \(G(S) = (S,1_S)\) where \(1_S: S \to \mathbb{N}\) is just the constant function with value \(1\). We can define the monad \({\rm DISTINCT} := G \circ F\). This is the content of the DISTINCT keyword in SQL.

Cross joins

Cross joins are just another word for a product. Once you realize a table is just a multiset, a cross join is truly the natural generalization of cartesian products within \({\rm Multiset}\). Given multisets (tables) \((X,m)\) and \((Y,n)\) the cartesian product (cross join), denoted \((X,m) \times (Y,n)\), is the multiset (table) \((X \times Y, m \cdot n)\) where \((m\cdot n)(x,y) := m(x)n(y)\).

Unions

Given sets \(X\) and \(Y\), you know what the union is. Given multisets \((X,m)\) and \((Y,n)\) the union is the multiset \((X \cup Y , m+n)\). This translates directly to the SQL notion of UNION

Natural (inner) Join

Here is an SQL query

SELECT *
FROM t1 JOIN t2
ON t1.col1 = t2.col2

This is one example of joining tables. Here is another (which assumes t1 has only one column)

SELECT *
FROM t1 JOIN t2
ON LOWER(t1)=t2.col2

Basically, an inner join requires that you have two tables \(t_1,t_2\), along with multiset morphisms \(f:t_1 \to b_1\) and \(g:t_2 \to b_2\). Then the inner join of \(t_1\) and \(t_2\) is the largest table which makes the diagram

inner join diagram

commute.

Full outer joins

The story for outer joins is similar, except you swap the direction of some of the arrows. An outer join is the minimal table which makes the diagram

outer join diagram

commute. In order to display the table (in the normal people sense) we will probably need to have NULL entries. However, the definition of outer-join does not even need a notion of NULL. It’s naturally incorporated.

Final thoughts

I wish I could take this further, but I am sleepy. After some googling I stumbled upon this stack exchange posting which seems like a decent rabbit hole to climb into. Unsurprisingly, it appears David Spivak has dug into this stuff a lot, and one of his papers mentioned in the stackexchange post is called Simplicial Databases. If that’s the title, the mental picture of databases is probably similar to the caveman paintings at the beginning of this post.

There are a lot of things that feel expressible in categorical terms that I puzzled over quite a bit but never figured out

Left Joins
Group by
Aggregation
Schemas

In particular, I feel as if GROUP BY eats multiset morphisms and outputs multisets of multisets, which then collapse back down to a “first order” multiset after an aggregation. It feels like there is a monad lurking, but I never managed to make this anything more than a feeling. If any of you readers have anything to add, can you reach out to me?

Shoutout

I used tikz-cd editor a lot in the writing of this post. Super awesome tool.