Data Science Weekly Newsletter

Issue

379

February 25, 2021

‍

Editor's Picks

‍

Byte Down: Making Netflix’s Data Infrastructure Cost-Effective
Our efficiency approach, therefore, is to provide cost transparency and place the efficiency context as close to the decision-makers as possible. Our highest leverage tool is a custom dashboard that serves as a feedback loop to data producers and consumers — it is the single holistic source of truth for cost and usage trends for Netflix’s data users. This post details our approach and lessons learned in creating our data efficiency dashboard...

Building a $5,000 Machine Learning Workstation
NVIDIA was kind enough to provide my YouTube channel with a TITAN RTX. This video shows the process that I went through to plan and build a computer based on a TITAN RTX and RYZEN ThreadRipper 24-core 3.8 ghtz CPU...

Time to Build Robots for Humans, Not to Replace
Thinking about the future of robots and autonomy is exciting; driverless cars, lights-out factories, urban air mobility, robotic surgeons available anywhere in the world. We’ve seen the building blocks come together in warehouses, retail stores, farms, and on the roads. It is now time to build robots for humans, not to replace them...

‍

A Message From This Week's Sponsor

‍

Data scientists are in demand on Vettery

Vettery is an online hiring marketplace that's changing the way people hire and get hired. Ready for a bold career move? Make a free profile, name your salary, and connect with hiring managers from top employers today.

‍

Data Science Articles & Videos

‍

Snorkel AI: Putting Data First in ML Development
Today I’m excited to announce Snorkel AI’s launch out of stealth! Snorkel AI, which spun out of the Stanford AI Lab in 2019, was founded on two simple premises: first, that the labeled training data machine learning models learn from is increasingly what determines the success or failure of AI applications. And second, that we can do much better than labeling this data entirely by hand....

Large scale experimentation
The experiments most companies run today are based on classical statistical techniques, in particular null hypothesis statistical testing. There, the focus is on analyzing a single experiment that is sufficiently powered. However, these techniques ignore one crucial aspect that is prevalent in many contemporary settings: we have many experiments to run and this introduces an opportunity cost: every time we assign an observation to one experiment, we lose the opportunity to assign it to another. We [Stitchfix] propose a new setting where we want to find “winning interventions” as quickly as possible in terms of samples used....

How we choose what to research
At Riskified, we have an extremely capable data science and research department, composed of over 30 data scientists and analysts. As a fast-growing startup we take a practical approach, measuring ourselves based on the actual business value delivered every quarter and how effectively we interpret data. This means that the majority of our quarterly planned work is driven by internal research initiatives with the minority coming from business stakeholders in other departments. One of the biggest challenges in any quarterly plan is identifying and deciding on the most promising research directions to undertake. While we can’t say that we’ve got it all figured out, this blog will try to shed some light on the tradeoffs we consider and how we tackle this problem...

Productionizing machine learning models, one thoughtful change at a time
Josh Tobin, former researcher at OpenAI and creator of Full Stack Deep Learning talks about professionalizing ML workflows for the real world, his work with the Robotics team and FSDL...

TaBERT: A new model for understanding queries over tabular data
TaBERT is the first model that has been pretrained to learn representations for both natural language sentences and tabular data. These sorts of representations are useful for natural language understanding tasks that involve joint reasoning over natural language sentences and tables. A representative example is semantic parsing over databases, where a natural language question (e.g., “Which country has the highest GDP?”) is mapped to a program executable over database (DB) tables...

Introducing cAInvas for TinyML devices
Machine learning and deep learning models have been changing markets, products and life styles one vertical at a time. Today, bringing ML models to embedded system could be as complex as stitching 8 steps in a flow...

Are Capsules a Good Idea? A Generative Perspective
I’ve recently written a paper on a fully probabilistic version of capsule networks. While trying to get this kind of model to work, I found some interesting conceptual issues with the ideas underlying capsule networks. Some of these issues are a bit philosophical in nature and I haven’t thought of a good way to pin them down in an ML conference paper. But I think they could inform research when we design new probabilistic vision models (and they are very interesting), so I’ve tried to give some insight into them here...

Terminology matters
Clashing terminology arises whenever researchers don’t read outside their own discipline area. There are lots of other examples of clashes between statistics and machine learning, and between statistics and econometrics...

Meta-Learning Requires Meta-Augmentation
Simply augmenting the data often yields bigger performance gains than tweaking the model. We formalize "meta-augmentation" and show that you can apply it to pretty much any meta-learning problem and any meta-learner...

‍

Training

‍

Quick Question For You: Do you want a Data Science job?

After helping hundred of readers like you get Data Science jobs, we've distilled all the real-world-tested advice into a self-directed course.
The course is broken down into three guides:

Data Science Getting Started Guide. This guide shows you how to figure out the knowledge gaps that MUST be closed in order for you to become a data scientist quickly and effectively (as well as the ones you can ignore)

Data Science Project Portfolio Guide. This guide teaches you how to start, structure, and develop your data science portfolio with the right goals and direction so that you are a hiring manager's dream candidate

Data Science Resume Guide. This guide shows how to make your resume promote your best parts, what to leave out, how to tailor it to each job you want, as well as how to make your cover letter so good it can't be ignored!

Click here to learn more
...
*Sponsored post. If you want to be featured here, or as our main sponsor, contact us!

‍

Jobs

‍

Senior Data Scientist - Grubhub - NY / Chicago

Grubhub is looking for a data scientist to join the Pricing team. As a part of Pricing, you’ll be a member of a small team of data scientists and engineers who shape and optimize how we charge our diners, shaping hundreds of millions in revenue annually. You will work closely both with financial stakeholders as well as engineers to ship models that make Grubhub more efficient with the way in which it charges customers. You’ll construct models and A/B tests as well as write code to improve our modeling capabilities...

Want to post a job here? Email us for details >> team@datascienceweekly.org

‍

Training & Resources

‍

Explaining RNNs without neural networks
There are lots of great articles, books, and videos that describe the functionality, mathematics, and behavior of RNNs so, don't worry, this isn't yet another rehash. (See below for a list of resources.) My goal is to present an explanation that avoids the neural network metaphor, stripping it down to its essence—a series of vector transformations that result in embeddings for variable-length input vectors...

PyTorch Implementation of Differentiable SDE Solvers
This codebase provides stochastic differential equation (SDE) solvers with GPU support and efficient sensitivity analysis. Similar to torchdiffeq, algorithms in this repository are fully supported to run on GPUs...

TensorFlow, Keras and deep learning, without a PhD
Clear, approachable, fun introduction to neural networks...

‍

Books

‍

Seven Databases in Seven Weeks:
A Guide to Modern Databases and the NoSQL Movement
"A book that tries to cover multiple database is a risky endeavor, a book that also provides hands on on each is even riskier but if implemented well leads to a great package. I loved the specific exercises the authors covered. A must read for all big data architects who don’t shy away from coding..."... For a detailed list of books covering Data Science, Machine Learning, AI and associated programming languages check out our resources page
.

P.S., Enjoy the newsletter? Please forward it to your friends and colleagues - we'd love to have them onboard :) All the best, Hannah & Sebastian

‍