How To Think About What Data Science Projects To Include In Portfolio

How To Think About What Data Science Projects To Include In Portfolio


You are looking for personal data science project ideas to build into your porfolio. As someone lacking large amounts of professional experience, you don't have tons of work experience to point to, so you want to make sure you choose the right things to include. The only issue is that there just so many different things you can do.

It's important to you to figure out what data science projects to include in your portfolio because you have realized that having a data science portfolio will make it easier to get interviews, easier to perform well during those interviews, and ultimately get you that data science job that you want. You know that hiring mangers will look favorably on your skills because they can see that you can do the technical work, as well as communicate and explain your work and how it fits into the bigger picture of a data science job.

Data Science Projects - so many choices, so little time

There are just so many different projects that you can do. You could give document scripts in R or Python or Scala or another statistical programming language. You could show cleaning/processing some dirty data set. You could create a machine learning pipeline with MapReduce or Spark to create recommendations. You could use python to create an ETL service for social media accounts. You could make data visualizations of different types of categorical data, numerical data, nominal data, or some type of combination. You could choose one of several different machine learning algorithms and explain what this particular model is good for this particular data set. You could analyze the computation complexity of different machine learning algorithms. You could even analyze the storage complexity of algorithms you just analyzed the computational complexity of.

And that doesn't even cover the higher level thinking of knowing whether it's more important to show familiarity with tools, good programming practices, your understanding of mathematical ideas, your understanding of statistics, or a combination of all of the things.

You do not have an infinite amount of time and resources to construct all of these and many more data science projects.

Data Science Projects - Lines, not Dots

When creating and putting together a data science portfolio, you want to make sure you are communicating three things with your work.

  • Your knowledge of the full data science process
  • Your technical knowledge
  • Your business-y knowledge

Each data science project you choose to include in your portfolio should have some part of it demonstrate your skills in one of the three categories. The important thing to realize is that you don't have to show all three in every single project, you can show various parts of your knowledge in different projects. By doing this you are going to be able to show lines, rather than dots. That is, though no single project will show off all of your skills, the combination of all of the projects will show off all of your skills.

Choose Data Science Projects which show your conceptual knowledge of the data science process

The data science process, according to "Introduction to Data Science" class Professors Joe Blitzstein and Hanspeter Pfister teach at Harvard, entails the following:

  1. Ask an interesting question
  2. Get the data
  3. Explore the data
  4. Model the data
  5. Communicate and Visualize the results

So when you go to choose a data science project, remembering "lines, not dots", you want to make sure that you cover at least one of the above steps.

If your project was going to cover the "Ask an interesting question" step, you'd want to demonstrate your thinking behind the questions "What is the scientific goal", "what would you do if you had all the data", and "what do you want to predict or estimate".

If your project was going to cover the "Get the data" step, you'd want to demonstrate your thinking behind the questions "How were the data sampled?", "Which data are relevant?", "Are there privacy issues"?.

If your project was going to cover the "Explore the data" step, you'd want to demonstrate your thinking behind the questions "How do I best plot the data?", "Are there anomalies?", "Are there patterns?".

If your project was going to cover the "Model the data" step, you'd want to demonstrate your thinking behind the questions "Which model should I choose?", "How do I build the model I chose?", "How do I fit the model?", "How do I validate the model?".

If your project was going to cover the "Communicate and visualize the results" step, you'd want to demonstrate your thinking behind the questions "What did we learn?", "Do the results make sense?", "How do I know the results make sense?", "Can we tell a story with the results and if so, what story will we tell?".

By describing your thinking behind some of these questions as part of your data science project, you'll be able to showcase your knowledge of the data science process.

Choose Data Science Projects which demonstrates your technical knowledge

Once you have an idea of how you are going to communicate your conceptual understanding of the data science process, it's worth thinking about what aspects of your technical knowledge you want to showcase. Note that here, "technical" covers both your programming and hacking skills as well as your math and statistics knowledge. Looking at Blitzstein and Pfister's data science process again:

  1. Ask an interesting question
  2. Get the data
  3. Explore the data
  4. Model the data
  5. Communicate and Visualize the results

When choosing a data science project, you want to choose a project that allows you to show your technical ability for different parts of the process.

If your project was going to cover the "Ask an interesting question" step, you'd want to demonstrate your technical thinking and choices behind the questions "What is the scientific goal", "what would you do if you had all the data", and "what do you want to predict or estimate".

If your project was going to cover the "Get the data" step, you'd want to demonstrate your technical thinking, choices, and implementation behind the questions "How were the data sampled?", "Which data are relevant?", "Are there privacy issues"?.

If your project was going to cover the "Explore the data" step, you'd want to demonstrate your technical thinking, choices, and implementation behind the questions "How do I best plot the data?", "Are there anomalies?", "Are there patterns?".

If your project was going to cover the "Model the data" step, you'd want to demonstrate your technical thinking, choices, and implementation behind the questions "Which model should I choose?", "How do I build the model I chose?", "How do I fit the model?", "How do I validate the model?".

If your project was going to cover the "Communicate and visualize the results" step, you'd want to demonstrate your technical thinking choices, and implementation behind the questions "What did we learn?", "Do the results make sense?", "How do I know the results make sense?", "Can we tell a story with the results and if so, what story will we tell?".

By describing your technical choices, and implementation behind some of these questions as part of your data science project, you'll be able to showcase your knowledge of the data science process.

Choose Data Science Projects which demonstrates your business-y knowledge

A very important part of Data Science is the business-y side of the work that is done. From Drew Conway's Data Science Venn Diagram, "Substantitive Expertise" helps tie in together the hacking skills as well as math and statistics knowledge to properly produce great data science work. So Once you have an idea of how you are going to communicate your conceptual understanding of the data science process as well as how you're going to do the technical implementation, it's worth thinking about what aspects of your business-y knowledge you want to showcase. Looking again at Blitzstein and Pfister's data science process:

  1. Ask an interesting question
  2. Get the data
  3. Explore the data
  4. Model the data
  5. Communicate and Visualize the results

When choosing a data science project, you want to choose a project that allows you to show your understanding and awareness of the business aspects and drives of the different parts of the data science process.

If your project was going to cover the "Ask an interesting question" step, you'd want to demonstrate your business-y knowledge and why it matters to the business behind the questions "What is the scientific goal", "what would you do if you had all the data", and "what do you want to predict or estimate".

If your project was going to cover the "Get the data" step, you'd want to demonstrate your business-y knowledge and why it matters to the business behind the questions "How were the data sampled?", "Which data are relevant?", "Are there privacy issues"?.

If your project was going to cover the "Explore the data" step, you'd want to demonstrate your business-y knowledge and why it matters to the business behind the questions "How do I best plot the data?", "Are there anomalies?", "Are there patterns?".

If your project was going to cover the "Model the data" step, you'd want to demonstrate your business-y knowledge and why it matters to the business behind the questions "Which model should I choose?", "How do I build the model I chose?", "How do I fit the model?", "How do I validate the model?".

If your project was going to cover the "Communicate and visualize the results" step, you'd want to demonstrate your business-y knowledge and why it matters to the business behind the questions "What did we learn?", "Do the results make sense?", "How do I know the results make sense?", "Can we tell a story with the results and if so, what story will we tell?".

By describing your business-y knowledge and why it matters to the business behind some of these questions as part of your data science project, you'll be able to showcase your awareness of how your data science project can and will fit within a larger organization.

Data Science Projects To Include In Your Portfolio Should, In Aggregate Show All Of Your Skills

You do not have an infinite amount of time and resources to construct data science project after data science project that showcases all of your process understanding, technical skills, and business-y knowledge. Especially if you try to demonstrate all of your skills with every single project. It's just too difficult and time consuming as you want to have more than one project in your portfolio.

What you want to do when creating and putting together your data science portfolio is to make sure that, in aggregate, the combination of projects shows your knowledge of the data science process, your technical abilities, and your business-y knowledge. The important thing is to realize that you don't have show everything in every single project, you can show various parts of your knowledge in different projects. This way, the combinations of all of the projects will show you to be a very well rounded job candidate.

So the next time you think about doing a data science project to include in your data science portfolio, your mind should instantly start thinking about how it will fit into the large theme and knowledge your portfolio is demonstrating. This way you can do smaller more well defined projects that fit in very well with your overall profile and message.

Good luck and start thinking about what part of your knowledge and skill set your next data science project will show!

Receive the Data Science Weekly Newsletter every Thursday

Easy to unsubscribe at any time. Your e-mail address is safe.