Data Science Weekly Newsletter

Issue

421

December 16, 2021

‍

Editor's Picks

‍

Lee Wilkinson’s contribution to interactive visualization
Upon learning this morning that Lee Wilkinson passed away I also felt compelled to write something on the extent to which his work has influenced interactive visualization research...The Grammar of Graphics was an incredibly ambitious undertaking – Wilkinson set out to create a system that could produce any statistical graphic he’d ever seen, and that could deepen understanding of the meaning of graphics...

Announcing the Transactions on Machine Learning Research
With this post, we’re happy to announce that we (Raia Hadsell, Kyunghyun Cho, Hugo Larochelle) are founding a new journal...the review process will be hosted by OpenReview, and therefore will be open and transparent to the community. Another differentiation from JMLR will be the use of double blind reviewing, the consequence being that the submission of previously published research, even with extension, will not be allowed. Finally, we intend to work hard on establishing a fast-turnaround review process, focusing in particular on shorter-form submissions that are common at machine learning conferences...

The Death of Feature Engineering is Greatly Exaggerated
One of the most exciting aspects of deep learning’s emergence in computer vision a few years ago was that it didn’t appear to require any feature engineering, unlike previous techniques like histograms-of-gradients or Haar cascades. As neural networks ate up other fields like NLP and speech, the hope was that feature engineering would become unnecessary for those domains too. At first I fully bought into this idea, and saw any remaining manually-engineered feature pipelines as legacy code that would soon be subsumed by more advanced models...Over the last few years of working with product teams to deploy models in production I’ve realized I was wrong...

‍

A Message From This Week's Sponsor

‍

Retool is the fast way to build an interface for any database With Retool, you don't need to be a developer to quickly build an app or dashboard on top of any data set. Data teams at companies like NBC use Retool to build any interface on top of their data—whether it's a simple read-write visualization or a full-fledged ML workflow. Drag and drop UI components—like tables and charts—to create apps. At every step, you can jump into the code to define the SQL queries and JavaScript that power how your app acts and connects to data. The result—less time on repetitive work and more time to discover insights.

‍

Data Science Articles & Videos

‍

PyTorch vs TensorFlow in 2022
Should you use PyTorch vs TensorFlow in 2022? This guide walks through the major pros and cons of PyTorch vs TensorFlow, and how you can pick the right framework...

Improving the factual accuracy of language models through web browsing
We’ve fine-tuned GPT-3 to more accurately answer open-ended questions using a text-based web browser. Our prototype copies how humans research answers to questions online – it submits search queries, follows links, and scrolls up and down web pages. It is trained to cite its sources, which makes it easier to give feedback to improve factual accuracy. We’re excited about developing more truthful AI,1 but challenges remain, such as coping with unfamiliar types of questions...

Optimization Nuggets: Exponential Convergence of SGD
This is the first of a series of blog posts on short and beautiful proofs in optimization (let me know what you think in the comments!). For this first post in the series I'll show that stochastic gradient descent (SGD) converges exponentially fast to a neighborhood of the solution...

Data with a Purpose with Moritz Stefaner
Meet Moritz Stefaner, a data designer who uses data for storytelling and who helped design the official German Covid-19 vaccine data dashboard. Moritz tells The Data Wranglers — Jeffrey Heer and Adam Wilson — how he creates a character from a dataset to give it emotional meaning and talks about the Covid vaccine clock he created. And, he dives into his data visualizations for train traffic on a German railroad network, the promises and pitfalls of using machine learning for data design, and what it took to visualize 175 years of text from Scientific American...

Using AI to bring children’s drawings to life
We’re excited to announce a first-of-its-kind method for automatically animating children’s hand-drawn figures of people and humanlike characters (i.e., a character with two arms, two legs, a head, etc.) that bring these drawings to life in a matter of minutes using AI. By uploading them to our prototype system, parents and children can experience the excitement of watching their drawings become moving characters that dance, skip, and jump...

Best of the visualisation web - August 2021
Since 2010 I have compiled and published monthly collections of links to some of the best, most interesting, or thought-provoking data visualisation-related content I come across. These collections are not always published immediately after the month in question has ended, but I try to do so as soon as my workload permits! Here's a collection of some of the best content I encountered during August 2021...

How AI Happens Podcast
How AI Happens is a podcast featuring experts and practitioners explaining their work at the cutting edge of Artificial Intelligence. Tune in to hear AI Researchers, Data Scientists, ML Engineers, and the leaders of today’s most exciting AI companies explain the newest and most challenging facets of their field...

The Science of Visual Data Communication: What Works
Effectively designed data visualizations allow viewers to use their powerful visual systems to understand patterns in data across science, education, health, and public policy. But ineffectively designed visualizations can cause confusion, misunderstanding, or even distrust—especially among viewers with low graphical literacy. We review research-backed guidelines for creating effective and intuitive visualizations oriented toward communicating data to students, coworkers, and the general public...

Modern Experimentation Platforms
Che Sharma is the founder and CEO of Eppo, an experimentation framework that integrates with modern data platforms (cloud lakehouses and cloud data warehouses). We discuss the importance of investing in experimentation tools and the power of having a well-oiled experimentation culture within an organization. Che also explains how modern data platforms enable a variety of applications, including experimentation frameworks like Eppo...

Training Machine Learning Models More Efficiently with Dataset Distillation
For a machine learning (ML) algorithm to be effective, useful features must be extracted from (often) large amounts of training data. However, this process can be made challenging due to the costs associated with training on such large datasets, both in terms of compute requirements and wall clock time. The idea of distillation plays an important role in these situations by reducing the resources required for the model to be effective. The most widely known form of distillation is model distillation (a.k.a. knowledge distillation), where the predictions of large, complex teacher models are distilled into smaller models...An alternative option to this model-space approach is dataset distillation, in which a large dataset is distilled into a synthetic, smaller dataset....

‍

Tools

‍

Free Course: Natural Language Processing (NLP) for Semantic Search Learn how to build semantic search applications by making machines understand language as people do. This free course covers everything you need to build state-of-the-art language models, from machine translation to question-answering, and more. Brought to you by Pinecone. Start reading now. *Sponsored post. If you want to be featured here, or as our main sponsor, contact us!

‍

Jobs

‍

Data Scientist, Decisions - Lyft - New York, NY Data Science is at the heart of Lyft’s products and decision-making. As a member of the Science team, you will work in a dynamic environment, where we embrace moving quickly to build the world’s best transportation. Data Scientists take on a variety of problems ranging from shaping critical business decisions to building algorithms that power our internal and external products. We’re looking for passionate, driven Data Scientists to take on some of the most interesting and impactful problems in ridesharing...

Want to post a job here? Email us for details >> team@datascienceweekly.org

‍

Training & Resources

‍

Python Practice Problems for Beginner Coders
From sifting through Twitter data to making your own Minecraft modifications, Python is one of the most versatile programming languages at a coder’s disposal. The open-source, object-oriented language is also quickly becoming one of the most-used languages in data science...To help readers practice the Python fundamentals, datascience@berkeley gathered six coding problems, including some from the W200: Introduction to Data Science Programming course. The questions below cover concepts ranging from basic data types to object-oriented programming using classes....

Book Draft: Distributional Reinforcement Learning
By considering the return distribution, rather than just the expected return, we gain a fresh perspective on the fundamental problems of reinforcement learning. This includes understanding of how optimal decisions should be made, methods for creating effective representations of an agent’s state, and the consequences of interacting with other learningagents. In fact, many of the tools we develop here are useful beyond reinforcement learning and decision making. We call the process of computing return distributions distributional dynamic programming; it can be applied in any situation where probability distributionsshould be propagated within some dependency structure...

How to Setup TensorFlow on Apple M1 Pro and M1 Max (works for M1 too)
Setup a TensorFlow environment on Apple's M1 chips. We'll get TensorFlow to use the M1 GPU as well as install common data science and machine learning libraries using Conda...

‍

Books

‍

Hands-On Machine Learning with scikit-learn and Scientific Python Toolkits
Integrate scikit-learn with various tools such as NumPy, pandas, imbalanced-learn, and scikit-surprise and use it to solve real-world machine learning problems...

For a detailed list of books covering Data Science, Machine Learning, AI and associated programming languages check out our resources page...

P.S., Enjoy the newsletter? Please forward it to your friends and colleagues - we'd love to have them onboard :) All the best, Hannah & Sebastian

‍