eno writer

023 - sets and bitsets, from python to zig

It's been over a year since I switched from programming in high level languages (python and javascript) to programming in a low level language (zig). Supposedly, high level languages are much easier to work with than low level languages, but I've been surprised by how infrequently I've felt myself longing for the expressiveness of python. In fact, last week was the first time I actually found myself having to switch to python to figure something out.

I was struggling with writing an algorithm to analyze some lists of unique elements. For example, given the following lists:

A, B, C, F
A, D, F
A, B, C, E
A, D, E

I wanted to write some code that would figure out the underlying logic:

A, (B, C or D), (F or E)

You may have noticed that every element in these lists is unique - there are no repeats. In math, a collection of unique elements is called a Set and it has a lot of useful properties. I never learned about this in school, but after many years of using a data structure called set in python, I discovered one day that it wasn't some ingenious invention of the authors of Python but a serious math thing.

I loved using sets in Python because they can allow for very succinct expressions of logic. For instance, say you have two users who each selected some items and you want to combine their selections into one. Using arrays you get this:

user1 = [A, B, C]
user2 = [B, C, D]
combined = user1
for item in user2:
    if item not in combined:
        combined.append(item)

Whereas with sets you can do this:

user1 = {A,B,C}
user2 = {B,C,D}
combined = user1 + user2

Getting back to my algorithm, I knew right off the bat that sets would make it easier to reason about this problem so I opened up a new python file, described my problem carefully to the AI and asked it to write me an algorithm. It failed miserably. Accepting that I would never get those 5 minutes of my life back, I did a little bit of hacking myself and came up with a solution. The first part of my algorithm worked by creating matrices of which changes appear or do not appear with which changes. The matrices were represented as lists of sets:

every = set([A, B, C, D, E, F] )
always = [every.copy() for _ in range(len(every))]
never = [every.copy() for _ in range(len(every))]

So always[1] would be a set where every element is an element that B always appears with (e.g. C). To populate these matrices, I would iterate through the samples and update them:

for item, idx in enumerate(every):
    for a in samples:
        if item in a:
            always[idx] = always[idx].intersection(a)
            never[idx] -= a

In the last couple lines you can see the lovely simplicity of set operations. Taking the intersection of two sets returns a new set that contains only elements present in both sets - progressively narrowing always until it has the correct elements. Subtracting two sets removes the elements of the second set from those in the first.

Now it was time to move this over to zig. Someone has written an open source set library for zig but I kind of have a thing against dependencies and it looked more complicated than what I needed for this assignment. Instead, I decided to implement it from scratch by breaking down the set operations into lower level operations. How hard could it be? I just need to implement intersection() and set subtraction - in both cases I just need to look through two lists of items and make a third based on what I find.

In some ways it is that simple. The problem is that in lower level languages, working with dynamic lists is a bit more complicated than in high level languages. Things are much easier if you aren't resizing arrays and adding and removing elements from them all the time. Wrestling with this, I came up with the idea of having a parallel list which would track whether a given element was present in the set I was dealing with.

For example, representing the set A, B, C, E would look like this:

every = [A, B, C, D, E, F]
is_present = [True, True, True, False, True, False]

Then I could implement intersection() like this in zig:

for(set1_present, 0..) |is_present, index| {
   if(set2_present[index] and is_present) {
      result_present[index] = true;
   } else {
      result_present[index] = false;
   }
}

Not too bad! After working with this for a bit everything started to fall into place. This whole solution seemed very obvious to me and I wondered if I had stumbled into some greater truth. While AI often fails me when I write code, it's very good at filling in gaps in my math and computer science education. Accordingly, I opened up my trusty AI and asked:

Is there a way to represent a set as an array of booleans where 
True indicates that the element is present in the set and False 
indicates it is not.  

And it responded:

Yes, representing a set as an array of booleans is a common strategy, 
particularly when dealing with sets of integers or elements that can be 
mapped to indices in a straightforward manner. This technique is often 
referred to as a bitset in computing.

Turns out my great idea is a mere "common strategy". This is a typical experience in my life as a self taught developer. Accepting that I hadn't just invented a new field of computer science, I opened up the zig standard library and ctrl+f'd for "bitset" and low and behold, there was an entire bit_set.zig file with many different types of BitSets to choose from. They even had methods like setIntersection()!

I proceeded to rewrite everything with these lovely bitsets and it all worked great. Ultimately, my zig solution essentially had line by line parity with the Python solution. Once again, the distinction between high and low level code was not as significant as advertised.


If you enjoyed this post, subscribe below to get notifications for the next one.

Powered by Buttondown.

We also have an RSS feed