Grabango Architecture

Grabango was a startup founded in 2016 by Pandora co-founder Will Glaser. It existed inside of a basket of checkout-free computer vision startups that aimed to use inference to make it so that grocery store shoppers could just walk in, put snacks in a basket, and then walk out. It failed and ran out of money. The company wasn’t able to secure enough contracts and the existing contracts weren’t able to generate enough revenue.

On October 17th, 2024, roughly two years after I was hired, Will held an all hands and let everyone know that Grabango didn’t have enough on-hand funds to continue operations and that we would no longer be employed by the end of the day. While heartbreaking, they let me keep my company laptop. That was nice.

Now a little over a year later, I want to write down what I remember of the architecture that I oversaw before time shifts my recollection from hazy memories to apocryphal hearsay.

Lay of the Land

At Grabango, I started out on a very “large” team of five engineers - a mix of devops and network engineers. As funding dried up, our team was eventually compressed to three engineers. While there, my responsibilities were to build release software for the IoT fleet at each of Grabango’s stores, provide platform tooling for developers, define cloud architecture, conduct deployments, create SLI/SLOs for services, define standards for cloud infrastructure, manage Kubernetes and improve cluster tooling, and respond to outages / conduct post-mortems. I also just… did random stuff like automate financial reporting. Idle hands don’t really exist in a startup.

So. Let’s start with a direct description of what Grabango hoped to achieve from a product perspective:

Grabango’s product was such that a user could walk into a store with the Grabango app on their phone. The app would have their payment information stored. The user would badge into a kiosk setup at the front of the store in order to start a shopping session. The customer would shop and then walk out of the store. Within about 10 minutes, their card would be charged. Checkout free. It worked.

To implement this, the entire store had been mapped out in advance using Matterport and there would also be a 2D plane representation of the store’s floor.

Mounted above the shelves on the store’s ceiling, there would be “rails” of Jetson Nanos (small computers with small gpus) configured into “strings” (multiple rails networked together). Each Jetson Nano had two MIPI CSI-2 cameras attached via ribbon cables and they were positioned above grocery store shelves so that they could track shoppers.

The rails didn’t just consist of multiple Nanos; they also contained a small device to power cycle the devices as well as switches for networking. Fun fact, because the switches were daisy-chained along the strings, there was an outage caused by network broadcast storms. The switches were cheap, semi-managed things in order to keep capex down and so STP wasn’t really a thing.

All of the rails reported back to a server half-rack that would be configured at the back of the store. The rack would be fairly self-contained: there was a UPS (uninterruptible power supply), two PDUs (power distribution unit), a switch from Juniper, a router, a 2U server from Supermicro, and some various other odds and ends. The Nanos existed on one VLAN, everything else on a separate management VLAN. We’d get a dedicated circuit from an ISP and the router would be the network entrypoint. Access would be controlled via OpenVPN and we’d typically do standard SSH directly into the server to do administration. Both the racked server and the Nanos ran Ubuntu.

┌──────────────────── STORE CEILING ────────────────────┐
│                                                       │
│   [Nano]─[Nano]─[Nano]─[Nano]    ← "string" of rails  │
│    2×cam  2×cam  2×cam  2×cam                         │
│                                                       │
│   [Nano]─[Nano]─[Nano]─[Nano]                         │
│                                                       │
│   [Kiosk NUC]           shoppers walk in / walk out   │
└────────────────────────┬──────────────────────────────┘
                         │
            Nano VLAN ───┤
                         │
            Mgmt VLAN ───┤
                         ▼
┌──────────────── BACK-OF-HOUSE HALF RACK ──────────────┐
│  UPS                                                  │
│  PDU × 2                                              │
│  Juniper switch                                       │
│  Router ── dedicated ISP circuit ── Internet          │
│  Supermicro 2U server  (Ubuntu)                       │
│     ├─ Nomad  (single-server)                         │
│     ├─ Consul                                         │
│     ├─ Kafka / Prometheus / Loki / Grafana            │
│     ├─ ZFS pool  (video retention)                    │
│     ├─ Local Docker registry + Nginx apt cache        │
│     └─ Git mirror  (for ansible-pull)                 │
└───────────────────────────────────────────────────────┘

On the Server

The server itself ran:

Nomad for compute orchestration
Consul
node_exporter (for host level metrics)
ZFS (for the drive pool where video data was stored)

Nomad Workloads | Why Nomad

I was why Nomad was adopted for onsite workload orchestration. The reason why I did this was because we needed to control the distribution of non-containerized and containerized workloads. For example, the Grabango kiosk that customers badged into was an Electron application that was originally deployed by an engineer SSHing into the kiosk NUC, SCPing an apt package over, installing that, and then doing a restart. Nomad - through its raw_exec driver - was able to manage the lifecycle of that kiosk application for free without the primary engineer needing to dedicate time to dockerizing that application. Even for workloads that could be containerized, effort would have had to be diverted to do so by an engineer who worked on that software. Adopting an orchestration system that was plug-and-play for existing solutions was beneficial given tight deadlines. We just… didn’t have a lot of staff. Also, Nomad was just easy to install and use. I actually really enjoyed working with it and used it as a basis for my homelab.

So. Nomad’s responsibilities were:

Deployed on Server | Nano: A Rust-based application for streaming video data from a ZFS pool on the main server to GCS buckets
Server: Kafka (Kafka’s role was to provide a messaging plane for inference data)
Server: Prometheus (a lot of telemetry data was collected)
Server: Loki (logging)
Server: Grafana (dashboards)
Server | Nano: A Python-based application that would track shoppers as they walked around the store
Nano: A Python daemon that would take in raw camera data and produce inference data
And other odds and ends

Each Nano had a Nomad client installed and the half-rack server acted as the Nomad master. Note, Nomad was not set to clustered mode and ran as a single-server deployment. Not the best but capex requirements always introduced tough constraints. I wanted to make this more resilient down the road but never was able to float it as a priority.

Why Not Kubernetes

Assuming Grabango had enough organizational resources to containerize all of their on-hardware workloads, Kubernetes was still a bit heavyweight at the time. K3s and microk8s were still a bit developmentally lean and I found the operational overhead contrast between Kubernetes and Nomad to be palpable. I ended up asking myself “if I have to maintain an implementation across dozens of stores and thousands of devices, which one is a conceptually simpler system to reason about and administrate” and I ended up arriving at Nomad.

I don’t regret my choice at all. Nomad worked REALLY well for deployments. I think I did miss k8s tooling like ArgoCD, Argo Workflows, etc (I was building that out for our cloud Kubernetes clusters) but Nomad was just a workhouse that was easy to setup and never failed.

Deployments

Workloads

Early on, something I strove for with the general engineering staff was health checks and telemetry data; I wanted to know the gradient of health for each device on the network within a store and for any workload running on those devices. A problem Grabango faced before I joined was extremely high failure rates during deployments with each deployment operating as a black box. The initial solution, before my time, was SSHing into the main server and kicking off a Saltstack run that would converge each Nano. Note, Salt would handle both the deployment of workloads AND the converging of OS state.

Faced with this state of the world, I wanted these changes:

The inference daemon that ran on the Nanos would have a health check endpoint and an expansion of Prometheus metrics
Nomad would handle releases for workloads
Ansible would handle OS state

Nomad job definitions were written for every workload that ran in a store. The Nanos and their inference daemons were given a rolling deploy strategy that processed serially. I had future plans to make it so that the process wouldn’t be serial, but there was always a priority crunch and so it was an acceptable business decision to have a simple and stable - if long - deployment mechanism.

Rollbacks and Failure Recovery

There were generally three problems encountered during a deploy:

A Nano doesn’t work
A camera(s) doesn’t work
There is a bug in the inference daemon

We only really wanted to stop deployments on condition three. On any other failure mode, we’d want to continue the rollout. The problem is that it’s difficult to differentiate between a hardware failure and a software failure when you’re rolling out to hundreds — if not thousands — of devices - at that point you’re just fighting statistics. If 10% of the cameras failed across the fleet and are in a degraded state then how do you differentiate between those hardware failures and a software bug during a high intensity operation like a fleet wide deployment?

At this point, I gotta mention that Grabango had an internally developed fleet management system. It was a swiss army knife for visualizing the state of the rails and providing some measure of control over them by issuing API requests to the power controllers embedded into each rail in order to reboot troublesome nodes. The details of the system escape me; I only interfaced systems with it, I didn’t build it. But it had an API for reading and writing device state and you could call it from within a Nomad deploy.

Telemetry from the Nanos would be fed into this state management system and it would produce a map of failed nodes (like an actual visual map of the store with red / green dots to indicate health). Nomad would then query this system on an allocation level as it performed rollouts in order to get an aggregate failure rate to help build a picture of whether or not the fleet was degrading due to hardware wonkiness or a general bug. Once that threshold was hit (rolling window on a percentage), allocations would fail and Nomad would stop the deployment but not revert the whole deploy. The reason we wouldn’t do an entire rollback is that we’d want to do a manual validation before triggering a return to a previous daemon version.

System State

Separate from the workload deployments, we also needed to codify OS state for the server and all the IoT edge devices. When I joined, this was all controlled by Saltstack but, honestly, we all hated Salt. I had some experience using Ansible to build AMIs for EC2 and so I designed a proposal for migrating over to Ansible and then we did the deed. The main server ran off of playbook runs triggered via Jenkins. The edge devices used ansible-pull.

There were multiple reasons why we ditched Salt:

The previous Salt codebase was a mess - it needed a heavy refactoring regardless
Salt was just buggy and we would have problems like memory pressure or silent failures
Ansible had ansible-pull which conceptually scaled better than a push-based model

Caching

When you have hundreds of IoT devices centrally located, you need an intermediary to handle the distribution of software and data to them otherwise you’re going to saturate the dependencies those devices need in order to reconcile their state. This meant that the main server often acted as a cache for Docker images, apt packages, git repository data, etc. and this was configured using a variety of mechanisms.

To frame the problem, let’s say that a store has 600 Nanos. Each one needs to pull a container image of 500MB. That’s 300GB of traffic passing through the WAN! Given that Grabango aimed to operate at bandwidth constrained sites, this could translate to saturated circuits. To solve that problem, a daemon would run on the server that would replicate images from the cloud (GCR) into a local registry. The reason why a sync based mechanism was chosen was because the native Docker registry only supported pull through caching for Docker Hub where Grabango used a hosted registry in Google Cloud. Cache invalidation was never needed since we always strictly tagged our releases (we used semver). Nomad job definitions would reference image tags so a sync failure would halt deploys before they started.

         GCP-hosted container registry
         (source of truth for images)
                    │
              sync daemon
              (pulls by semver tag,
               runs on the store server)
                    │
                    ▼  1 × WAN pull per release
 ┌────────────── STORE SERVER (cache tier) ─────────────┐
 │                                                      │
 │   Local Docker registry  ◄── Nanos pull images       │
 │   Nginx apt cache        ◄── Nanos pull .debs        │
 │   Git mirror             ◄── ansible-pull on Nanos   │
 │                                                      │
 └──────────────────────────────────────────────────────┘
                    ▲
                    │   LAN (no WAN hops)
                    │
         [Nano] [Nano] [Nano] ... × hundreds per store

 Without cache: 600 Nanos × 500 MB = 300 GB over WAN per deploy
 With cache:    1 × 500 MB over WAN,  300 GB over LAN

We also had apt packages that needed to be replicated for some legacy stuff. I set up Nginx for this as I was fairly familiar with it. Not really much to say.

Ansible-pull references git repositories in order to reconcile state. It inverts Ansible’s push-based deploy mechanism and makes it a pull-based one - which is far more scalable. But with that pull-based model, we couldn’t have thousands of Nanos hitting GitHub to clone repositories… so I ended up setting up a git mirror on each server for the Nanos to talk to.

Testing and Release Management

There were multiple stages that software went through before it was released to production. We were generally pretty careful at Grabango - the number one cause of outages was ISP outages, not faulty deploys.

We did have a decent HIL (hardware-in-the-loop) setup:

A deployment to the “test rack”

A test rack was a rack loaded up with the same hardware profile that we would install in a store and would have Nanos attached to it in order to cheaply replicate a store deployment. This would be our first deployment target. This could be thought of as a “smoke test” - as in, if you turn it on and it starts to smoke then something is broken. We would do some basic validations here but honestly it was mostly a surface-level “does it turn on.”

Deployment to “the store”

The next step was to deploy to a very small convenience store built into the Grabango office itself. It was honestly pretty neat. You could badge in, “buy” some very expired diet coke, and then walk out with your goods. This was a validation that anybody could do and was a fairly concrete “does the entire shopping experience function at a basic level.” And that was a nice confirmation when you have a really complex system involving dozens of components.

Deployment to “defunct client”

This one was kind of amusing. Grabango had a client that essentially… stopped… paying… them money. I believe there was some legal pursuit here to get some sort of financial recompense, but this cost Grabango a whole lot of money. One upside is that it left a couple of production sites available as testing grounds for larger scale production deploys as these would be stores with hundreds of Nanos (or Raspberry Pis).

 ┌────────────┐    ┌────────────┐    ┌──────────────┐    ┌────────────┐
 │ Test rack  │ ─► │  Office    │ ─► │   Defunct-   │ ─► │ Production │
 │  (HIL —    │    │  in-house  │    │  client      │    │   fleet    │
 │ smoke test)│    │  "store"   │    │  stores      │    │ (per-site) │
 └─────┬──────┘    └─────┬──────┘    └──────┬───────┘    └─────┬──────┘
       │                 │                  │                  │
   Jenkins (Nomad    manual shopping    production-scale   manual,
   job on the rack)   validation         dry-run            per-client
                                                            sign-off

                     ◄── every gate is manual ──►
             (environments too bespoke to automate
              promotion with the confidence we had)

Release Promotion

Releases were pretty much always manually gated. We never really got the confidence and release velocity down enough to have an automated system robust enough to automate promotions of releases. The confidence just wasn’t there. Granted, deployments always went well… but all of the environments were so bespoke that we just couldn’t justify putting in the engineering work for a highly complex release automation system. The test rack wasn’t the test store which wasn’t the store for the defunct client. And even between clients, stores were pretty different with scales of deployment ranging from a few dozen Nanos to several hundred.

Releases weren’t signed. We pinned images by digest and did SHA256 level validation but nothing more complex than that.

Jenkins, State, and Telemetry

While there was unit testing for every application written, integration tests would mostly run within the confines of the test racks and the facsimile store. Jenkins would run as an agent within a Nomad job on the test rack itself with the Nanos acting as targets.

The general pipeline steps were:

Wipe state as much as possible. This could be purging Kafka topics, deleting video data, etc.
Validate Kafka events
Validate that video data is being written to the ZFS pool
Validate that health checks are passing
Validate metrics and telemetry are coming back within acceptable boundaries

Testing hardware, is, uh, difficult. It’s not like you can just operate off of clean room mocked data sets - you need to validate against a very large set of variables. For example, what if you don’t test temperature variation for your devices when running under load? What about when running under load with a high ambient temperature? Have you put your devices in a heat chamber, cranked up the temps, and then monitored the outcome? No? Well, what if your edge devices are deployed on the ceiling of a stuffy convenience store in Arizona during the summer? Want to know what happens? They roast. And then you’re paged.

This means that Grabango’s HIL testing harnesses needed to produce a large set of data that fed into Grafana dashboards. What I cannot state enough is that health is NOT binary, health is a GRADIENT and you know the shape of that gradient through telemetry. As we ran integration tests against hardware profiles, we would measure thermal envelopes, frames captured per second, bounding box creation timings, etc. An increase in a couple of degrees in temperature on average could very well translate to an entire store going down because it just got too hot.

But something I really adored about a hardware forward role was just how genuinely amusing and novel engineering problems could be. A big benefit of cloud computing is just how homogeneous your environment could be and how straightforward the failure modes are. Why did your piece of edge compute fail? Well, maybe a delivery driver with a teetering stack of beer crates failed to stick the landing and dumped several pints of Busch Light onto your badge in kiosk. Maybe the ISP outage wasn’t because of a failure of service, maybe it’s because a mail truck drove into the power lines near the 7-11 your product is deployed into. Or, my personal favorite, maybe the employee who had to work in the same space as the server… just turned it off because it was too loud (valid).

In my interview, I was asked: “if you just got a call from a customer that a piece of hardware had failed on site, what are your next steps?”

“I’ll ask them if it’s plugged in.”

Developer Tooling and Dagger

Part of the job as a devops/platform engineer/sre/release engineer/WHATEVER is to build tooling for other developers. In the context of Grabango, this meant I needed to build something that:

Created a build environment that bridges CI/CD and a developer’s local environment
Provide a unified way to upload packages to Artifactory
Provide a homogeneous environment to run unit and integration tests for Rust, Python, and Node.js/TypeScript
Perform ad-hoc Nomad deploys to test environments
Wrap metric and telemetry query operations for test environments
Fetch data from different areas to paint holistic data flows

This led me to a tool called Dagger. I still geek out about it. I think it’s easiest to describe it as a “pluggable execution engine” for deploying software. What initially drove me to use it at Grabango was that I wanted a way to write automation code in Go that supported Grabango’s library of in-use languages.

Let’s provide context.

I have a firmware developer whose responsibilities are many within the company. They constantly use the test racks and other similar hardware constructs to run green field experiments or execute custom workflows. Let’s say that for a greenfield project they need to build a Rust application, upload its binary to Artifactory, generate a Nomad job file that will give some basic defaults, deploy that Nomad job to one of the test environments, monitor its deployment and subsequent telemetry data, and then clean up that deployment. This process could involve several tools, API calls, downloads, uploads, version generation, etc. Dagger allowed me to create a single execution point that wrapped every - containerized - operation in a subcommand.

Dagger also allowed us to wrap every lint, test, and build step into a containerized environment that was easily portable to our CI/CD system. The testing and release pipeline was THE SAME on Jenkins as it was on a developer’s local laptop.

And I didn’t have to write any Groovy DSL or Jenkinsfiles to implement any of this. Dagger supports a lot of different programming languages through SDKs for its engine. You can express any CI process using whatever language you’re comfortable with. muah Chef’s kiss.

What I wanted to get to

If I had more time…

Distributed CI Caching

Dagger has the ability to leverage a distributed build cache. This makes it so that every BuildKit operation references a shared cache and this could be across local developer builds, CI builds, and anywhere else that Docker would be executed. This would have been a pretty big time-saver. Though, honestly, Grabango just wasn’t at the scale that this would have been useful so it was never prioritized - it was more of a curiosity for myself.

Advanced Deployment Strategies

As stated, Nomad rolling deployments used a query to collect an aggregated failure rate and applied that to a rolling window to determine whether or not to halt a rollout. But deployments were still serial. This was mostly because we couldn’t deploy at random; if multiple cameras were rolled at once, we’d lose the redundant vision coverage of a shelf. What I wanted to do was figure out a way to build a deployment topology where Nomad would be aware of which cameras were adjacent to others and roll out updates in waves. Or, at the very least, just deploy to one half of the fleet at a time where vision was redundant (even then there were edge cases like cameras above entrances). It would have been a fun problem to work through.

Clustered Nomad at Deployment Sites

Nomad ran in single-server mode on the servers themselves. Something that I would have liked was to include 3 NUCs in our hardware profile so that I could run a real cluster. However, capex budgets never really allowed for such frivolity so nothing could be done about it.

Ditch Jenkins in General

The Jenkinsfiles / Groovy we maintained were mostly replaced by Dagger. It would have been nice to finish that work and deprecate the service.

Closing

I miss Grabango. Mostly because I miss my team. I think it’s pretty rare to find a work environment you feel comfortable in and want to spend years and years being a part of. Most people migrate after a couple of years but I think I would have taken the long road with Grabango if it had remained solvent - just on the back of how wonderful it was to share that time with the people I was privileged enough to work alongside.