

Data Science at Home
Francesco Gadaleta
Cutting through AI bullsh*t.Come join the discussion on Discord! https://discord.gg/4UNKGf3
Episodes
Mentioned books

May 20, 2020 • 22min
Compressing deep learning models: distillation (Ep.104)
Using large deep learning models on limited hardware or edge devices is definitely prohibitive. There are methods to compress large models by orders of magnitude and maintain similar accuracy during inference.
In this episode I explain one of the first methods: knowledge distillation
Come join us on Slack
Reference
Distilling the Knowledge in a Neural Network https://arxiv.org/abs/1503.02531
Knowledge Distillation and Student-Teacher Learning for Visual Intelligence: A Review and New Outlooks https://arxiv.org/abs/2004.05937

May 8, 2020 • 20min
Pandemics and the risks of collecting data (Ep. 103)
Codiv-19 is an emergency. True. Let's just not prepare for another emergency about privacy violation when this one is over.
Join our new Slack channel
This episode is supported by Proton. You can check them out at protonmail.com or protonvpn.com

Apr 19, 2020 • 15min
Why average can get your predictions very wrong (ep. 102)
Whenever people reason about probability of events, they have the tendency to consider average values between two extremes.
In this episode I explain why such a way of approximating is wrong and dangerous, with a numerical example.
We are moving our community to Slack. See you there!

Apr 1, 2020 • 22min
Activate deep learning neurons faster with Dynamic RELU (ep. 101)
In this episode I briefly explain the concept behind activation functions in deep learning. One of the most widely used activation function is the rectified linear unit (ReLU).
While there are several flavors of ReLU in the literature, in this episode I speak about a very interesting approach that keeps computational complexity low while improving performance quite consistently.
This episode is supported by pryml.io. At pryml we let companies share confidential data. Visit our website.
Don't forget to join us on discord channel to propose new episode or discuss the previous ones.
References
Dynamic ReLU https://arxiv.org/abs/2003.10027

Mar 23, 2020 • 24min
WARNING!! Neural networks can memorize secrets (ep. 100)
One of the best features of neural networks and machine learning models is to memorize patterns from training data and apply those to unseen observations. That's where the magic is.
However, there are scenarios in which the same machine learning models learn patterns so well such that they can disclose some of the data they have been trained on. This phenomenon goes under the name of unintended memorization and it is extremely dangerous.
Think about a language generator that discloses the passwords, the credit card numbers and the social security numbers of the records it has been trained on. Or more generally, think about a synthetic data generator that can disclose the training data it is trying to protect.
In this episode I explain why unintended memorization is a real problem in machine learning. Except for differentially private training there is no other way to mitigate such a problem in realistic conditions.
At Pryml we are very aware of this. Which is why we have been developing a synthetic data generation technology that is not affected by such an issue.
This episode is supported by Harmonizely.
Harmonizely lets you build your own unique scheduling page based on your availability so you can start scheduling meetings in just a couple minutes.
Get started by connecting your online calendar and configuring your meeting preferences.
Then, start sharing your scheduling page with your invitees!
References
The Secret Sharer: Evaluating and Testing Unintended Memorization in Neural Networks
https://www.usenix.org/conference/usenixsecurity19/presentation/carlini

Mar 14, 2020 • 20min
Attacks to machine learning model: inferring ownership of training data (Ep. 99)
In this episode I explain a very effective technique that allows one to infer the membership of any record at hand to the (private) training dataset used to train the target model. The effectiveness of such technique is due to the fact that it works on black-box models of which there is no access to the data used for training, nor model parameters and hyperparameters. Such a scenario is very realistic and typical of machine learning as a service APIs.
This episode is supported by pryml.io, a platform I am personally working on that enables data sharing without giving up confidentiality.
As promised below is the schema of the attack explained in the episode.
References
Membership Inference Attacks Against Machine Learning Models

Mar 8, 2020 • 14min
Don't be naive with data anonymization (Ep. 98)
Masking, obfuscating, stripping, shuffling.
All the above techniques try to do one simple thing: keeping the data private while sharing it with third parties. Unfortunately, they are not the silver bullet to confidentiality.
All the players in the synthetic data space rely on simplistic techniques that are not secure, might not be compliant and risky for production.
At pryml we do things differently.

Mar 1, 2020 • 11min
Why sharing real data is dangerous (Ep. 97)
There are very good reasons why a financial institution should never share their data. Actually, they should never even move their data. Ever.
In this episode I explain you why.

Feb 22, 2020 • 14min
Building reproducible machine learning in production (Ep. 96)
Building reproducible models is essential for all those scenarios in which the lead developer is collaborating with other team members. Reproducibility in machine learning shall not be an art, rather it should be achieved via a methodical approach.
In this episode I give a few suggestions about how to make your ML models reproducible and keep your workflow as smooth.
Enjoy the show!
Come visit us on our discord channel and have a chat

Feb 14, 2020 • 13min
Bridging the gap between data science and data engineering: metrics (Ep. 95)
Data science and data engineering are usually two different departments in organisations. Bridging the gap between the two is essential to success. Many times the brilliant applications created by data scientists don't find a match in production, just because they are not production-ready.
In this episode I have a talk with Daan Gerits, co-founder and CTO at Pryml.io


