Short review: Building Machine Learning Powered Applications

I enjoyed Emmanuel Ameisen’s book Building Machine Learning Powered Applications. (Companion Github repo.) I believe I’ve encountered most of the ideas in it before through work, etc., but it was still helpful to see them laid out altogether. (It makes me thankful again to coworkers who shared helpful ideas previously.) I also appreciated the interviews with practitioners in the field.

Some of the pitfalls with test set splitting mentioned definitely match my previous experience:

  • Avoid including future events in the training data.
  • Avoid splitting duplicate examples between test and training.
  • Avoid splitting data for the same individual between test and training.

Maybe it’s so obvious it doesn’t need mentioning (I don’t think Ameisen mentions it), but one silly thing I did in a personal project was to accidentally split augmented data from the same underlying examples between test and training, which I guess is like a species of the 2nd pitfall.

Something else I’d be curious to see addressed sometime is Frank Harrell’s worries about the use of improper scoring rules: Damage Caused by Classification Accuracy and Other Discontinuous Improper Accuracy Scoring Rules.

One thing I want to keep track of after returning it to the library is the list of datasets:

I also plan to look more into the links he shared on how various companies do online experimentation:

I also want to revisit: The ML Test Score: A Rubric for ML Production Readiness and Technical Debt Reduction.

I’m being a little careful with spending money, but maybe I’ll spring for my own copy of Ameisen’s book.


Comments

Leave a Reply

Your email address will not be published. Required fields are marked *