Linear Digressions

Katie Malone
undefined
Jan 5, 2020 • 25min

Data scientists: beware of simple metrics

Picking a metric for a problem means defining how you’ll measure success in solving that problem. Which sounds important, because it is, but oftentimes new data scientists only get experience with a few kinds of metrics when they’re learning and those metrics have real shortcomings when you think about what they tell you, or don’t, about how well you’re really solving the underlying problem. This episode takes a step back and says, what are some metrics that are popular with data scientists, why are they popular, and what are their shortcomings when it comes to the real world? There’s been a lot of great thinking and writing recently on this topic, and we cover a lot of that discussion along with some perspective of our own. Relevant links: https://www.fast.ai/2019/09/24/metrics/ https://arxiv.org/abs/1909.12475 https://medium.com/shoprunner/evaluating-classification-models-1-ff0730801f17 https://hbr.org/2019/09/dont-let-metrics-undermine-your-business
undefined
Dec 30, 2019 • 26min

Communicating data science, from academia to industry

For something as multifaceted and ill-defined as data science, communication and sharing best practices across the field can be extremely valuable but also extremely, well, multifaceted and ill-defined. That doesn’t bother our guest today, Prof. Xiao-Li Meng of the Harvard statistics department, who is leading an effort to start an open-access Data Science Review journal in the model of the Harvard Business Review or Law Review. This episode features Xiao-Li talking about the need he sees for a central gathering place for data scientists in academia, industry, and government to come together to learn from (and teach!) each other. Relevant links: https://hdsr.mitpress.mit.edu/
undefined
Dec 23, 2019 • 19min

Optimizing for the short-term vs. the long-term

When data scientists run experiments, like A/B tests, it’s really easy to plan on a period of a few days to a few weeks for collecting data. The thing is, the change that’s being evaluated might have effects that last a lot longer than a few days or a few weeks—having a big sale might increase sales this week, but doing that repeatedly will teach customers to wait until there’s a sale and never buy anything at full price, which could ultimately drive down revenue in the long term. Increasing the volume of ads on a website might lead people to click on more ads in the short term, but in the long term they’ll be more likely to visually block the ads out and learn to ignore them. But these long-term effects aren’t apparent from the short-term experiment, so this week we’re talking about a paper from Google research that confronts the short-term vs. long-term tradeoff, and how to measure long-term effects from short-term experiments. Relevant links: https://research.google/pubs/pub43887/
undefined
Dec 16, 2019 • 28min

Interview with Prof. Andrew Lo, on using data science to inform complex business decisions

This episode features Prof. Andrew Lo, the author of a paper that we discussed recently on Linear Digressions, in which Prof. Lo uses data to predict whether a medicine in the development pipeline will eventually go on to win FDA approval. This episode gets into the story behind that paper: how the approval prospects of different drugs inform the investment decisions of pharma companies, how to stitch together siloed and incomplete datasts to form a coherent picture, and how the academics building some of these models think about when and how their work can make it out of academia and into industry. Professor Lo is an expert in business (he teaches at the MIT Sloan School of Management) and work like his shows how data science can open up new ways of doing business. Relevant links: https://hdsr.mitpress.mit.edu/pub/ct67j043
undefined
Dec 8, 2019 • 25min

Using machine learning to predict drug approvals

One of the hottest areas in data science and machine learning right now is healthcare: the size of the healthcare industry, the amount of data it generates, and the myriad improvements possible in the healthcare system lay the groundwork for compelling, innovative new data initiatives. One spot that drives much of the cost of medicine is the riskiness of developing new drugs: drug trials can cost hundreds of millions of dollars to run and, especially given that numerous medicines end up failing to get approval from the FDA, pharmaceutical companies want to have as much insight as possible about whether a drug is more or less likely to make it through clinical trials and on to approval. Professor Andrew Lo and collaborators at MIT Sloan School of Management is taking a look at this prediction task using machine learning, and has an article in the Harvard Data Science Review showing what they were able to find. It’s a fascinating example of how data science can be used to address business needs in creative but very targeted and effective ways. Relevant links: https://hdsr.mitpress.mit.edu/pub/ct67j043
undefined
Dec 2, 2019 • 43min

Facial recognition, society, and the law

Facial recognition being used in everyday life seemed far-off not too long ago. Increasingly, it’s being used and advanced widely and with increasing speed, which means that our technical capabilities are starting to outpace (if they haven’t already) our consensus as a society about what is acceptable in facial recognition and what isn’t. The threats to privacy, fairness, and freedom are real, and Microsoft has become one of the first large companies using this technology to speak out in specific support of its regulation through legislation. Their arguments are interesting, provocative, and even if you don’t agree with every point they make or harbor some skepticism, there’s a lot to think about in what they’re saying.
undefined
Nov 25, 2019 • 28min

Lessons learned from doing data science, at scale, in industry

If you’ve taken a machine learning class, or read up on A/B tests, you likely have a decent grounding in the theoretical pillars of data science. But if you’re in a position to have actually built lots of models or run lots of experiments, there’s almost certainly a bunch of extra “street smarts” insights you’ve had that go beyond the “books smarts” of more academic studies. The data scientists at Booking.com, who run build models and experiments constantly, have written a paper that bridges the gap and talks about what non-obvious things they’ve learned from that practice. In this episode we read and digest that paper, talking through the gotchas that they don’t always teach in a classroom but that make data science tricky and interesting in the real world. Relevant links: https://www.kdd.org/kdd2019/accepted-papers/view/150-successful-machine-learning-models-6-lessons-learned-at-booking.com
undefined
Nov 18, 2019 • 36min

Varsity A/B Testing

When you want to understand if doing something causes something else to happen, like if a change to a website causes and dip or rise in downstream conversions, the gold standard analysis method is to use randomized controlled trials. Once you’ve properly randomized the treatment and effect, the analysis methods are well-understood and there are great tools in R and python (and other languages) to find the effects. However, when you’re operating at scale, the logistics of running all those tests, and reaching correct conclusions reliably, becomes the main challenge—making sure the right metrics are being computed, you know when to stop an experiment, you minimize the chances of finding spurious results, and many other issues that are simple to track for one or two experiments but become real challenges for dozens or hundreds of experiments. Nonetheless, the reality is that there might be dozens or hundreds of experiments worth running. So in this episode, we’ll work through some of the most important issues for running experiments at scale, with strong support from a series of great blog posts from Airbnb about how they solve this very issue. For some blog post links relevant to this episode, visit lineardigressions.com
undefined
Nov 11, 2019 • 25min

The Care and Feeding of Data Scientists: Growing Careers

In the third and final installment of a conversation with Michelangelo D’Agostino, VP of Data Science and Engineering at Shoprunner, about growing and mentoring data scientists on your team. Some of our topics of conversation include how to institute hack time as a way to learn new things, what career growth looks like in data science, and how to institutionalize professional growth as part of a career ladder. As with the other episodes in this series, the topics we cover today are also covered in the O’Reilly report linked below. Relevant links: https://oreilly-ds-report.s3.amazonaws.com/Care_and_Feeding_of_Data_Scientists.pdf
undefined
Nov 4, 2019 • 20min

The Care and Feeding of Data Scientists: Recruiting and Hiring Data Scientists

This week’s episode is the second in a three-part interview series with Michelangelo D’Agostino, VP of Data Science at Shoprunner. This discussion centers on building a team, which means recruiting, interviewing and hiring data scientists. Since data science talent is in such high demand, and data scientists are understandably choosy about where they go to work, a good recruiting and hiring program can have a big impact on the size and quality of the team. Our chat covers much a couple of sections in our dual-authored O’Reilly report, “The Care and Feeding of Data Scientists,” which you can read at the link below. https://oreilly-ds-report.s3.amazonaws.com/Care_and_Feeding_of_Data_Scientists.pdf

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!
App store bannerPlay store banner
Get the app