Abstract nonsense lurking beneath SQL
This is part of the “machine learning for disillusioned mathematicians” series. Let’s get straight to it.
Visualizing tables
A cartoon of a table with only one column is a line with a bunch of dots.
Each dot represents a “row” of the table. Similarly, a table with two columns is cartoonified as a bunch of line segments between two lines.
Here, each line segment corresponds to a row of the table. Naturally, a table with three columns is visualized as a bunch of triangles
And for four column tables, we use tetrahedra, but I can’t draw it because I’m lazy. \(N\)-column tables are visualized as a bunch of \(N\)-simplices, and I don’t need to draw it because it’s sooooooo easy to visualize an \(N\)-simplex if you are an inter-dimensional being.
nullible fields?
If a field can take on a null value, we get slightly different visuals. For example, a three column table where the third field is nullible is visualized as
What one should note is that there are both triangles and line segments (as opposed to purely triangles as in the non-nullible case). So nullible fields allow for mixed shapes.
Visualizing Unions and Joins
This is a cartoon of what UNION
looks like
This is a cartoon of an INNER JOIN
An Outer Join
And finally, a cross Join
Perhaps you think that was abstract nonsense?
… Hold my beer.
My spidey-sense tells me that INNER JOIN
is a limit of the diagram
Moreover, OUTER JOIN
is a co-limit of the same diagram.
Finally, SQL’s CROSS JOIN
is a limit of the diagram
\(\bullet \quad \bullet\).
I’ll take my beer back now.
I’m going to puke
I apologize. With only 4 lines of text and two diagrams, I probably ejected 90% of my potential readership. That was clearly an unhealthy amount of abstraction.
There really is no reason to write/read a blog post on SQL joins because they are such a simple concepts. The writer of such an article would need to be so stupid, so ignorant, that he/she would probably stoop as low as redefining the most basic concept, like “tables”.
The notion of a table
These cartoons suggest that a table is abstractly represented as some sort of collection of \(N\)-simplices (possibly of varying dimension). However, it’s nut just a set. It’s a multiset. We can formally represent a table as a multiset equipped with a collection of morphisms called “columns”. Lets unpack this.
Multi-what?
A set is just a bag of distinct things. For example \(\{1,2,3\}\) is a set. If you take the set \(\{1,2,3\}\) and insert a \(1\) again, you still get \(\{1,2,3\}\), as opposed to something that has two copies of \(1\). A set has no notion of multiplicity.
Enhancing sets with multiplicity yields the concept of a multiset as a pair \((S,m)\) where \(S\) is a set and \(m: S \to \mathbb{N}\) denotes the multiplicity of each element. Here \(\mathbb{N} = \{ 1, 2, 3, \dots \}\) denotes the natural numbers.
I like to think of multisets as the toy models of measures. Just like measure, the natural action of a map on a multiset is a push-forward. A morphism from \((X,m)\) to \((Y,n)\) is a (set) map \(f:X \to Y\) such that the diagram
commutes. The collection of multisets along with these arrows form a category I will call \({\rm Multiset}\).
Tables in a relational database are multisets, as opposed to just plain old sets, because it’s possible to have duplicate rows in a table.
Example: a two column table
Now there are some deranged freaks who insist that a table is something that looks like
f_0 | f_1 |
---|---|
1 | true |
2 | true |
2 | true |
3 | false |
I call these sick and twisted individuals “normal people”. However, it’s perfectly valid to represent this so-called “table” as the multiset \((S, m)\) where \(S = \{(1,true),(2,true),(3,false)\}\) and
\[\begin{align*} m(x,y) = \begin{cases} 1 & \text{if } (x,y) = (1,true) \text{ or } (3,false) \\ 2 & \text{if } (x,y) = (2,true) \\ \end{cases} \end{align*}\]The columns are the maps \(f_0(x,y) = x\) and \(f_1(x,y) = y\) for \((x,y) \in S\). So we see a table is a multiset.
Example: nullible fields
You can represent a table with a nullible field such as
f_0 | f_1 |
---|---|
1 | true |
2 | true |
2 | true |
3 | false |
2 | NULL |
by considering the set \(S = \{(1,true),(2,true),(3,false), 2 \}\), and the same multiplicity function as the previous section, but extended by defining \(m(2)=1\). In any case, NULL
is naturally accounted for by multisets, and NULL
entries are merely an artifact that occasionally arises when we display a multiset as a (normal person’s) table.
The DISTINCT
monad
The forgetful functor \(F: (S,m) \in {\rm Multiset} \mapsto S \in {\rm Set}\) has an adjoint, \(G:{\rm Set} \to {\rm Multiset}\) given by \(G(S) = (S,1_S)\) where \(1_S: S \to \mathbb{N}\) is just the constant function with value \(1\). We can define the monad \({\rm DISTINCT} := G \circ F\). This is the content of the DISTINCT
keyword in SQL.
Cross joins
Cross joins are just another word for a product. Once you realize a table is just a multiset, a cross join is truly the natural generalization of cartesian products within \({\rm Multiset}\). Given multisets (tables) \((X,m)\) and \((Y,n)\) the cartesian product (cross join), denoted \((X,m) \times (Y,n)\), is the multiset (table) \((X \times Y, m \cdot n)\) where \((m\cdot n)(x,y) := m(x)n(y)\).
Unions
Given sets \(X\) and \(Y\), you know what the union is. Given multisets \((X,m)\) and \((Y,n)\) the union is the multiset \((X \cup Y , m+n)\). This translates directly to the SQL notion of UNION
Natural (inner) Join
Here is an SQL query
SELECT *
FROM t1 JOIN t2
ON t1.col1 = t2.col2
This is one example of joining tables. Here is another (which assumes t1 has only one column)
SELECT *
FROM t1 JOIN t2
ON LOWER(t1)=t2.col2
Basically, an inner join requires that you have two tables \(t_1,t_2\), along with multiset morphisms \(f:t_1 \to b_1\) and \(g:t_2 \to b_2\). Then the inner join of \(t_1\) and \(t_2\) is the largest table which makes the diagram
commute.
Full outer joins
The story for outer joins is similar, except you swap the direction of some of the arrows. An outer join is the minimal table which makes the diagram
commute. In order to display the table (in the normal people sense) we will probably need to have NULL
entries. However, the definition of outer-join does not even need a notion of NULL
. It’s naturally incorporated.
Final thoughts
I wish I could take this further, but I am sleepy. After some googling I stumbled upon this stack exchange posting which seems like a decent rabbit hole to climb into. Unsurprisingly, it appears David Spivak has dug into this stuff a lot, and one of his papers mentioned in the stackexchange post is called Simplicial Databases. If that’s the title, the mental picture of databases is probably similar to the caveman paintings at the beginning of this post.
There are a lot of things that feel expressible in categorical terms that I puzzled over quite a bit but never figured out
- Left Joins
- Group by
- Aggregation
- Schemas
In particular, I feel as if GROUP BY
eats multiset morphisms and outputs multisets of multisets, which then collapse back down to a “first order” multiset after an aggregation. It feels like there is a monad lurking, but I never managed to make this anything more than a feeling. If any of you readers have anything to add, can you reach out to me?
Shoutout
I used tikz-cd editor a lot in the writing of this post. Super awesome tool.