Recurse SP2'23 #4: A Familiar Gopher

Small highlights from today:

I got to collaborate with some folks on a fun scraping and data analysis project this morning. It was nice to write some Go, and I have some thoughts below!
I got coffee with a local Recurser, Stevie, to discuss audio programming, Zig, and more.
I implemented some really slow Gaussian blurring as a follow-up to yesterday’s reading. I talked to my friend about speeding this up, but I want to work that implementation out before writing about it here. Soon!

A big question I had for Stevie was “how do different processes pass audio stream data between one another?”

For example, how does audio capture work if I’m sharing my screen over Zoom while watching a video? What about audio applications - doesn’t JACK have a mixer component? I’m guessing there’s a shared buffer somewhere, but who owns it? I’d love to learn more on this.

Getting back to Go: a scraper

Nikki has a fun scraping and data analysis project that Tom and I have joined in on. I’ll save the specifics of the project for when it’s gotten further along, and presently focus on the fun I had coding this morning.

I haven’t gotten to write much Go in the last year, so it was nice to get back into it. Scrapers are also fun since they usually involve a lot of fun tasks: web requests, HTML parsing (okay not all fun), worker pooling and communication, data wrangling and modeling, etc.

This was also a good opportunity to try some Go packages I wasn’t familiar with. I used chromedp as a headless browser, since the pages would only load data via Javascript. I found chromedp’s API a little tough to work with, perhaps because I’m not familiar with the DevTools protocol. I handed the resulting data off to goquery, which provided a more friendly, jQuery-like interface. This made it easy to extract the data I wanted.

The code is a bit longer than I’d like to just dump into the blog, so I’ve thrown it in a Gist for anyone interested. In the interest of time and not knocking over their server, we’ll probably take a different approach (asking or FOIA’ing) to getting the data. But this was still a fun exercise for the morning!

Gopher appreciation

I think my favorite part of this process was being reminded of the things I like about Go. The language is simple enough to remember, and it has great support for the things I find interesting. Even after a long break, it’s so easy to pick it back up and get a useful, concurrent system working in no time. Most of the concurrency involved was possible with just channels and no imports! It was still a good use case for sync.WaitGroup(), which only added a few lines of code:

"sync"                                        // Import
var wg sync.WaitGroup                         // Instantiate the WaitGroup
wg.Add(1)                                     // Count each worker
w := &Worker{w_id, jobChan, recordChan, &wg}  // Register WaitGroup with each worker
wg.Wait()                                     // Wait in main loop for jobs to complete
wg         *sync.WaitGroup                    // From Worker struct definition
defer w.wg.Done()                             // Ensure worker decrements waitgroup count when finished

If you haven’t seen sync.WaitGroup() before, it’s very handy! It lets your main thread kick off a lot of goroutines, increment the size of the wait group for each, and then block on wg.Wait() until the count is decremented back to zero. This happens when each worker goroutine has called wg.Done(). That’s it!

You can see in the code how I’m using it to wait until each worker has finished its work and passed all records to the record handler. When the workers are all done, I tell the record handler that there are no more records to receive, then wait for its signal before actually exiting the program. This is a similar (but simpler) synchronization to avoid interrupting any writes that the record handler might still be completing.

Reflections on the code

Here are some things I’d like to improve or learn about:

I should be more polite and set a user agent with contact info. Probably easy to configure in chromedp.
I definitely need to handle job failure and site unavailability:
- Push failed jobs onto a re-do queue
- Enable backoff/slowdown if enough jobs are failing
I still need to write out to CSV or DB when done debugging. I might return to this to get some practice with data modeling in PostgreSQL.
I suspect I could have made it work with just chromedp, but I found goquery much more accessible for quickly getting data from the elements I was interested in.
I could probably get away without also using goquery.
- This is probably do-able in chromedp, but goquery’s interface felt more accessible for my needs.
- I might also try a different headless browser like webloop.