bskycharts post-mortem

Jotting down some notes before I forget. A lot of false starts over past few days, kind of starting to run together. Also, I know some of you like this behind-the-scenes stuff.

It all started because I wanted to run some feedgens and bskycharts off of a single subscribeRepos firehose connection. But I didn't want one big megascript that did everything, so I thought read the firehose once and use redis pub/sub to publish the events that way any subscriber can listen.

Wednesday PM: is done. Each message is published into a channel named after the record's type so a listener can selectively subscribe to just what it wants. bskycharts was cutover, ran for a bit, everything looked good, went to bed.

Thursday AM: Woke up to a broken charts website. It wasn't handling tooBig events properly. Catch-up was slow (realized I was doing decode->encode->decode) but it eventually caught up, ran rest of the day, seemed fine, and I thought everything was smooth sailing.

Friday AM: Woke up again to a broken charts website. This time something crashed and the subsequent catch-up crashed the whole server. Logs had OOM kills, load average was over 20, it was a mess. Decided to stop everything and reset.

Friday AM/PM: First decided to centralize the script locations. A new would rebroadcast everything. A utility function would handle parsing commit events. Did a test run, crashed the server. Was publishing faster than consuming. Back to the drawing board.

Friday PM: Decided to see if websocat could work as rebroadcaster. Complex docs, but seems possible. Do a test run locally using firehose cursor from Friday AM but parser crashes after a few thousand events. Still not sure what's up. Go to bed feeling defeated and annoyed.

Edit 3/31: websocat buffer-size was too small, increasing to websocat -B 1048576 seemed to do the trick

Saturday AM: Decided I didn't want to lose accurate user counts. Went back and hacked up bsky-activity to read directly from using Friday AM firehose cursor. Restarted process, seems to be working. First sense of relief though also annoyed because right back to square one.

Saturday mid-AM: Realized after 3-ish hours DAU was going to be inflated. It was using current timestamp when it should be using repo event timestamp. Never really an issue during live but is during big catch-up. Right now processing the backlog, about 8.5 hours behind.

Biggest thing I would do differently is set up a staging environment. Doing this all on production was pretty dumb. Learned a lot about redis, though. And part of me enjoyed the high-stakes debugging on Thursday. It was the Friday crash where I started getting annoyed at myself.

Original thread:

Created: 2024-03-30
Last Updated: 2024-03-31

← Back