7 steps to build a data science project in spare time
Advice from Google’s chief economist
Interesting data science project in a portfolio could not only validate your skills but also help find a job. Beginner data scientists often struggle to come up with ideas for their next project.
But there is a way to find ideas and to determine that they are worth time. Dr Hal R. Varian is the Chief Economist at Google and also an emeritus professor at the University of California, Berkeley. Maybe some of you, who studied economics, recognize him as the author of economics textbooks. In the mid-nineties, Varian described his method of making economic models in the popular essay.
His advice is also useful for data scientists in their work on machine learning models. Below I paraphrased his tips and distilled useful ones for building data science projects.
- Get the idea worth pursuing
- Don’t look for literature too early
- Build your model
- Let yourself make mistakes.
- Now search the literature
- Discuss with others
- Know when to stop
“The model of research that I describe is an idealization of reality, much like the economic models that I create.”
Get the idea worth pursuing
“The first step is to get an idea. This is not all that hard to do. The tricky part is to get a good idea.”
The first step is crucial. If you will take up the idea that’s filling you with emotions, you will work on it with enthusiasm. To generate exciting concepts worth your time, you should come up with lots and lots of ideas and throw out those that aren’t good enough.
Ideas may come from many different places. But while most of the beginners start to seek inspiration in scientific papers, Dr Varian says it is not a good source of original ideas. Fine, reading articles about data science improves your craft by getting you familiarized with new techniques and insight. But at the same time, your mind is getting biased by someone else’s ideas.
“I think that you should look for your ideas outside the academic journals — in newspapers, in magazines, in conversations, and in TV and radio programs.”
So, what are the best places?
- newspapers and magazines
- TV and radio programs
- podcasts
- conversations
- business people around you
- your own life and experiences
When you read about non-data science content, you could discover problems or inefficiencies, that you can solve with AI/ML. You can also simply use your memories and bring ideas from your own experiences. Ask yourself how would you help a person like live better, make better decisions or spend less money?
“One of my favorite pieces of my own work is the paper I wrote on “A Model of Sales”. I had decided to get a new TV so I followed the ads in the newspaper to get an idea of how much it would cost. (…) Once I developed the model I had a research assistant go through a couple of years’ worth of the Ann Arbor News searching for the prices of color TVs. Much to my delight the general pattern of pricing was similar to that predicted by the model. And, yes, I did manage to get a pretty good deal on the TV I eventually bought.”
You will know when you come up with a good idea. If your idea appeals to your feeling of usefulness, this could be nice to work on.
- Generate a lot of ideas by reading and listening outside your specialisation.
- Take ideas that appeal to your feeling of usefulness.
Is your idea worth pursuing?
The first test of your idea should be trying to explain it in a way that a person unfamiliar with AI/ML can understand. Try to paraphrase your idea in a way that a six years old can understand it. If you can’t, the idea is probably not very good. Or you overestimated your level of knowledge. In this case, take a step back and read more about this topic.
But even if you can explain the subject to a child and not bore them to death, there isn’t still a certainty that the idea isn’t too lousy. Although it could be tempting at first sight, pursuing a first viable idea doesn’t guarantee anything. You can ask a few friends what they think about your idea. If your concept has a lot of implications and is interesting for other people, you should spend some time working on it.
“Always remember that working on this particular idea has an opportunity cost — you could be spending your time working on a different idea.”
- Try to paraphrase the idea
- Ask a few people, is it interesting?
- Could this idea have a lot of implications?
Don’t use Google too soon
Ok, maybe that’s not exactly the words of Dr Varian. He means you shouldn’t look for literature too soon.
“The first thing that most graduate students do is they rush to the literature to see if someone else had this idea already. However, my advice is to wait a bit before you look at the literature.”
Eventually, you definitely would go to the literature and review others’ approaches in similar matters. But it will be better if you will work on your idea for a few days or weeks.
It is because:
- You need practice in developing models.
- You might come up with a different approach that found in the literature. You are much more likely to be original if you develop your insights.
- Your ideas need to incubate to interact with others and produce something new and interesting.
So, don’t seek similar approaches in papers too soon. You will be biased by them.
Build your model
Great, now you have an idea!
Armed with an unbiased mind and your data science skills, you are ready to build your model. At this point, you probably know what to do. Most machine learning models look pretty much the same.
The basic structure suggests a plan of attack:
- Answer a few questions:
a. What’s my input?
b. What should output look like?
c. What features should I consider?
2. Decide which dataset to work on.
3. Take the simplest example. Then another one, and another one.
4. See what is common to your examples.
5. Start with something very specific and then generalize.
6. Keep it simple, stupid.
Let yourself make mistakes
Maybe I should cite Thomas Edison speaking about his thousand ways to not invent the light bulb. But the Internet doesn’t like him anymore, right? Besides, it would be hella cliché. However, I couldn’t stress it enough — let yourself make mistakes.
There are a lot of mistakes on the way to achieve new skills.
“This takes a surprisingly long time — there are usually lots of false starts, frustrating diversions, and general fumbling around. But keep at it! If it were easy to do, it would have already been done.”
Now search the literature
When you have something working, you can search for the literature about similar problems. Probably, you will find some notebooks on Kaggle or Github, containing similar models, but it will be much better done, much more accurate, and much clearer.
When you stumble at methods that you didn’t use, ask yourself “Why didn’t I do that?”.
“When you’ve worked on a topic for several months, you tend to lose a lot of perspective. First, you may think something is obvious when it really isn’t. The other possibility is that you may think something is complicated when it is really obvious — you’ve wandered into a forest via a meandering path. So at this point you’ve got to start getting some independent judgment of your work.”
Discuss with others
After research and adjusting your results, you are ready to go public.
“Talk to your advisor, talk to your fellow students, talk to your wife, husband, girlfriend, boyfriend, neighbour, or pet. . . whoever you can get to listen.”
Use your favourite method to reveal your work to the audience:
- Medium article ;)
- Kaggle notebook
- Group on Slack
- Github
- Youtube
If you prefer not to publish your work, you can find a mentor. Sharing your work with others exposes you to their judgements.
“When you’ve worked on a topic for several months — or even several weeks — you tend to lose a lot of perspective . . . literally.”
The useful thing about discussing with others is that you get immediate feedback.
Know when to stop
“You can tell when your work is getting ready for publication by the reactions in the seminars: people stop asking questions.”
Done is better than perfect. Our minds can’t properly express what “perfect” means. If your goal is perfection, you will continue to find errors or ways to improve your project.
However, if your project generates useful output and every attempt for your next optimization fails, you should accept that and start seeking new challenges.
References
All citations in the article are from the original essay: