Mechanisms for Effective Machine Learning Projects




  • If your team is like most teams I’ve been on, you have 2 - 3 problems for every available person. Thus, each member works on 1 or 2 problems simultaneously, with some folks taking 3 or more. And because everyone’s so busy, we barely have time to check in on each other’s projects outside of standup, planning, retrospective, etc. (View Highlight)
  • One solution is to have a pilot and copilot for each project. The pilot is the main project owner and is in charge of its success (or failure). They own and delegate the work as required though they’re usually responsible for the bulk of design and critical code paths. (View Highlight)
  • The copilot helps the pilot stay on track, identify critical flaws, and call out blindspots. This includes periodic check-ins, reviewing document drafts and prototypes, and being a mandatory code reviewer. For example, the copilot should challenge the pilot if the proposed design doesn’t solve the business problem, or if the train-validation split is invalid. To be able to spot these issues, the copilot typically has experience in the problem space, or has more experience in general, similar to how senior engineers guide juniors. (View Highlight)
  • Pilots and copilots don’t have to be from the same job family. As an applied scientist, I often partner with an engineer who helps with infrastructure, observability, CI/CD, etc. If both scientist and engineer are sufficiently experienced, they can double up as each other’s copilot. As they review each other’s work, knowledge transfer occurs organically and they learn to be effective copilots for other engineers or scientists in future projects. (View Highlight)
  • For a literature review, I read papers relevant to the problem. I’m biased towards solutions that have been applied in industry though more academic papers have also been helpful. (View Highlight)
  • How the problem was framed: If it’s fraud detection, is it framed as a classification problem (fraud vs. no fraud) or regression problem (quantum of fraud)? Or is it framed as an anomaly detection problem to be solved via unsupervised learning? (View Highlight)
  • How input data was processed: How was data excluded, preprocessed, and rebalanced? How were labels defined? Was a third neural class added? How were labels augmented, perhaps via hard mining? (View Highlight)
  • How the model was evaluated offline: How was the training and validation set created? What offline evaluation metrics did they use? How did they improve the correlation between offline and online evaluation metrics? (View Highlight)
  • Input data and features: Am I using data that would not be available during inference? For example, if I’m predicting hospitalization costs during pre-admission, am I peeking into the future and using features such as length of hospitalization? If so, it’s a data leak as we won’t know the length of stay in advance and it’s highly correlated with hospitalization cost. (View Highlight)
  • Offline evaluation: If we’re building a forecast model, are we splitting data by time or just randomly? The former ensures we don’t learn on future data while the latter is invalid. (Most production uses cases should have data split by time.) (View Highlight)
  • . Timeboxing for machine learning projects can be challenging, because compared to engineering projects, the work is relatively ill-defined. Furthermore, a large part of the work is research and experimentation which unfortunately leads to many a dead end. (View Highlight)
  • But it’s because of these challenges that timeboxing is effective—how much effort should we invest before pivoting? In most industry settings, we don’t have limitless resources to pursue a problem for years. (View Highlight)