Parsing “Pollinator Post”

May Chen runs a treasure of an e-mail newsletter, the “Pollinator Post”, where she shares macro shots of insects and plants, mostly local natives, from her hikes around the East Bay, along with commentary drawing on her decades of experience as a docent. Until recently, these posts were available only via an invite-only e-mail group, but now they are available for everyone here: https://bringingbackthenatives.net/pollinator-posts, part of Kathy Kramer’s Bringing Back the Natives Garden Tour site.

I’m happy to have played a part in making this happen. A little over a year ago, I reached out to Kathy offering to volunteer, and she suggested working on this project. I’ve worked on it in fits and bursts since then, urged on occasionally by her.

Here are the bits I developed for this:

  • A Chrome extension to do a bulk download of the content from a Google Group: gg_download. Each post gets stored as an MHTML file in your local Downloads folder. (I also extended Chrome to support MHTML downloads in AppleScript. I was inspired to do this when one version of the extension failed to download a few posts, and I wanted a way to script picking those up.)
    • Originally, I had intended to fetch the images and text separately, but ran into browser security restrictions. (I forget the details now.) chrome.pageCapture provided a nice workaround, but at the cost of making the extension Chrome only for now.
    • An extension, especially one unlikely (?) to be approved for the Chrome Web Store, has the disadvantage of requiring side loading (via chrome://extensions); it’s also not well suited for automated processes (though you could probably hack together something to get that working).
    • On the other hand, I liked being able to work within a normal browser, logged in using a normal flow; I felt like it might be less likely to lead to my account being banned, which was one concern always lurking in my mind.
    • If I need to return to this again, I’ll probably look at doing it in Python. I’m getting some ideas now flipping through my library’s copy of Web Scraping with Python.
  • gg_mhtml_to_site processes the MHTML files into an index plus extracted HTML and images. I used this project as an opportunity to play with Rust. (I’m still getting a feel for it.) One thing that impressed me: the ecosystem of packages available (and Rust’s tooling for incorporating them): lol_html, etc.
  • gg_posts_json_site provides JavaScript and CSS for generating a display of the files generated by gg_mhtml_to_site. It includes support for pagination. We didn’t end up using this for the final version of the site, but it was useful for demo purposes.
  • Finally, gg_posts_csv_flat_images takes the output from gg_mhtml_to_site and converts it into a format more convenient for ingesting the post data into WordPress. This could also have been accomplished by modifying gg_mhtml_to_site directly. For me at least though, it was quicker hacking on this in Python, and this feels more like a “one-off” versus the others so I preferred to keep it separate.

Jeffrey Samorano did the final work integrating the posts into WordPress and making them functional and pretty. It was a pleasure working with him on getting the data ready for this.

I’m grateful to Kathy and Jeffrey for the opportunity to have worked on this. And we all owe May Chen much gratitude for all the beauty and wonder she’s shared with the world through her photos and words.

So, be sure to take a look at the new site for May Chen’s Pollinator Post and check out Bringing Back the Natives Garden Tour.


Comments

Leave a Reply

Your email address will not be published. Required fields are marked *