A new perspective on data science
Beginner’s Guide to Teaching Yourself Data science, and Choosing the Right Platform to Learn
I was an engineer in the manufacturing industry (and still am so far), things became boring so I decided to make some changes and learn something new. In the beginning of this year, I followed everyone who are transforming their career would do nowadays, I started to learn coding, in data science and machine learning.
I am still learning, but I have picked up many things that I wouldn’t imagine knowing 6 months before. At first I just learned on various platforms, since May I started updating my projects in GitHub to track progress.
I find self-learning fun and challenging , and thanks to everyone who is sharing their experience, I gained immense knowledge. Among the advices shared by self-taught data scientists, planning and picking resources in the very beginning of the self-learning journey is less shared, so I think I should do so.
If you find this article too long, I covered in details on how to choose learning flatforms in the middle section, which I think are less discussed elsewhere and is the most worth reading part.
How it started
It started with one of Coursera’s most popular course, Machine Learning by Andrew Ng. I attempted to complete it a year ago, but stopped because at that time I felt it was hard to digest and my motivation was not strong enough. To push myself further, I paid for the course so that I have to complete it (including assignments)within 6 months to get the certificate. Time passed very fast, before I knew it, with one month left I have not started again with the course. Not to waste my money, I started learning hard. To my surprise, once I focused on the idea of “having to complete it”, it was not that hard. Although the course is not easy, the content comparable to a university course, I managed to work through it. I learnt MATLAB enough just to complete the assignments, I made schedule for the course, spending 3hr+ daily.
So a few days before my payment expired, I made it, I completed the course in one month. It was a big relief, and I felt a great sense of fulfilment, which I have not felt at work before. My confidence grew too, I felt I can do anything :), so I decided to continue learning.
Research aka Googling
Completing a Machine Learning course was hardly an experience, I still knew next to nothing. I had many questions so I started googling. Below are the things I searched, you may follow my path, but feel free to customise your Q&A.
- The differences among the buzz words like Data Science, Machine Learning, Deep Learning, AI.
Here is a wonderful article by Monica Rogati: The AI Hierarchy of Needs which share some industry truths.
- What are the skill sets needed for data/ML/AI related jobs?
You may explicitly google this question, but I personally search these job titles on LinkedIn (or any job sites), and write down the required skills in an organised checklist which I can uncheck in the future.
My skill list -Basic (in order of importance):
- Python basics
- Python in data science (pandas:data processing, matplotlib/seaborn: data visualization, scikit-learn: machine learning library)
- Statistics, A/B testing, Hypothesis testing
- Git, shell
- Natural language processing
- Image processing/Computer vision
- Time series
- Tensorflow/Pytorch (deep learning)
- Tableau (business intelligence dashboard)
- Databases (nonSQL: MongoDB)
- Spark (big data)
- Hadoop (big data)
- Scala (big data)
Tips: Now many platforms offer python for data science programmes, which cover all skills in my basic list. You are free to try and take one that you like, then just stick to it.
- Browsing through the job titles in my last search, I also wondered how different are data scientist, data/business analyst, data engineer, and machine learning engineer?
- To put it simply, data analysts focus more on business sides and I would compare them to product managers, but in background of data.
- Data engineers work with databases and big data, corporates often have enormous amount of data so data handling are full time jobs. When scale of data and speed of data processing is key, this is not as easy as it sounds.
- Machine learning engineers are more on the programming side, they work on algorithms and architecture stuff.
- Data scientist is a broad definition, it may be a combination of data engineer and analyst, the duties depend on the specific roles. However in a bigger department where data engineers, analysts and data scientists roles are clearly divided, data scientists are less involved with business than the analysts. They should build machine learning models and pipelines, working with analysts and data engineers closely.
4. Last but not least, how to become a data scientist/analyst/engineer? Where to learn the skills?
At first, you just read as many Quora answers, Youtube tips and blogs as possible, until you can draft an initial plan for yourself.
The plan is open to changes, whenever you see a better approach, just add it in your plan, make deduction as well, don’t over grow it.
While you are building your plan, you have to follow the plan at the same time so that you can decide if an approach works for you.
The more you try and select, the better your plan will suit you.
Here are some fine blogs and answers I liked the most:
1. What are the best resources online to learn about data science? — Quora
2. Chen: How can I become a data scientist? — Quora
3. Becoming a Self-Taught Data Scientist — Towards Data Science
4. Learn Python the Right Way in 5 Steps — Dataquest
Highlight of this article ↓ Recommend to read.
How to choose Learning platforms:
After knowing the skills I need to learn, and having in mind a list of recommended learning platforms from my research, I will try these websites, check their fees and make my choices.
You should know your needs before choosing a platform, ask yourself these questions:
-Do you want to pick up a skill quickly?
-Do you want to learn the concepts and basics very well?
-Do you want to get a certificate upon completing the course?
-Do you feel like completing projects to get a certificate?
-How do you like to learn, by reading, watching videos or interactive coding?
-What’s your budget for online learning? Or do you intend to spend no money at all?
-Do you need career services?
I used Codecademy to learn Python and data science basics. It is a text-based interactive website, you get to code step by step and check the answers.
Pros: The content is well designed, the Data science path cover all the skills in my basic skills checklist, including SQL, Git, Shell, statistics, data wrangling, visualization and machine learning with scikit-learn.
The course offers background and schematics to explain to websites, business scenarios as case studies, considering the fee is only USD$20/month, it is very worth it.
Cons: Designed for beginners, step-by-step learning and repetitive practice is not most timely efficient. So do skip when you feel the course is slow sometimes.
Website is text based, there are no video tutorials unless for certain practice questions. This is no problem with me, but some people do prefer videos. Codecademy does not have a linked certificate to add to LinkedIn, although from Jul 2020, Codecademy does offer certificate for download only.
Overall, it is a very affordable platform with good contents for beginners.
After using Codecademy for a while, I attended a one day crash course offered to my university’s alumni and was introduced the Kaggle website.
Known as the most popular machine learning competition sites, Kaggle also offers free courses on most of the skills needed for data science, machine learning and even deep learning.
Pros: First, kudos for free content.
Second, courses are designed like crash courses, brief, efficient and to the point. Each lesson comes with a exercise that you can check the answers. It also helps you to get a taste of Jupyter notebook if you are new.
They even give you a certificate for completing a course, for encouragement.
Cons: As the problem with crash courses, you need to practice a lot elsewhere to retain the knowledge you learned. Next step following Kaggle learn is to join beginner level Kaggle competitions, google when you have questions, read other contestants’ work.
However, competitions are more difficult than your normal course exercises, it is difficult to get a high score. I felt overwhelmed and discouraged when checking other’s answers as Kaggle is full of experts :) Shame on me. So after completing the Kaggle course, I went back to codecademy to practice the basics. I will try Kaggle when I am more confident with my skills.
Still, Kaggle is one of the best websites to learn machine learning!
Pros: Coursera courses are basically like courses you learnt at school, at a university. Concepts are well taught, and there are many courses to choose from. The course is of high quality because they are offered by worldwide universities.
You can audit a course for free, to brush up your weakness in concepts. Or you can pay $79 one time to access assignments and get the course certificate. Coursera certificate is well recognised, it looks good in Linkedin.
Coursera now offers Specialization, with a subscription fee of USD$49/month. It will be strong qualification if you complete it. Overall Coursera fee is worth the recognition of a skill acquired as it is not easy to complete.
Cons: As with university courses, if you have zero experience in a field, completing a coursera course may not bring you anywhere. You need to practice elsewhere and make some projects as well.
Coursera courses are focused on knowledge, not on industry.
For example, in the famous Stanford Machine learning course, the assignments are done in MATLAB. While in industry, Python is most used.
So know your needs clearly, there are other machine learning courses conducted in python, for example by University of Michigan. There are also courses conducted in R, a popular language for data science academically.
For beginners, I will recommend bootcamp alike programmes instead of Coursera, to pick up skills faster. Coursera courses are always good to refer to the concepts.
Pros: Udacity is a well known online bootcamp platform, it offers a good selection of courses, among which many are free.
Courses are mostly conducted in videos, and focus heavily on applications, thus less on concepts. Free courses are short and good for picking up new skills.
They offer nano-programmes too, which cost USD$300–400 per month, which are quite expensive. Paid programmes offers 1 coaching session per month, reviews of your GitHub, Linkedin profile and unlimited advises on your resume and cover letter.
The projects Udacity incorporate into the (paid) course are most praised. They are very practical and based on industry cases, so completing them will add to your experience too. Projects are not too easy, you can post your questions in the mentor help section and surely get answered, projects will be reviewed personally with many advices after you submit it.
You will receive a certificate upon completing all the projects at the end of each lesson. If you are likely to follow schedule and want to join a boot camp, Udacity is a good choice.
Cons: Paid programmes are quite expensive, estimated completing time is one to two months even if you learn continuously in spare time.
Courses are practical, thus the concepts are not explained in depth.
Tip: Take use of discounts and trials. Try clicking cancel subscription in your first month, then the service staff will offer you a deep discount to continue the subscription.
Other fish in the sea
There are many platforms offering data science courses, as well as knowledge sharing websites and video streaming websites that teach for free. As long as you want to learn, there are resources available on the web more than you need.
If you have difficulty choosing one, just pick one and stick to it. If it does not work for you, try another one.
Alumni trainings: Universities and educational institutions often offer crash courses for continued growth to alumni at discounted prices.
Although online learning is good enough, I do believe in the power of in-person training. If the fee is appropriate, just go. These courses contain concentrated information and many resources for a beginner to refer to in a long way.
I was first introduced to Kaggle learn and a good handbook by such a course.
MOOCs: Udemy, edX, Khan Academy. I have not tried them, but you are free to.
University courses that they did not put on MOOCs: for example, the well known CS109 Harvard Class — Data science.
Datacamp: interactive learning by coding.
Github: Many contents or course notes are shared in GitHub repositories, in Jupyter notebook format, which contains explanatory markdowns and codes. Create a GitHub account, star those you want to check later.
eBooks: Many seriously written books are shared online, check your most stalked data science influencers’ recommendations.
Someone recommends this book and I like it: Python Data Science Handbook
Youtube: search the keywords and save them to a list. You know the drill.
Documentation: Python, Scikit-learn, Pytorch, Tensorflow, Google Collab, Jupyter notebook, Pandas. They all have tutorials for beginners.
Blogs: Follow data science in Medium, the amount of information will suffocate you. Experts and enthusiasts also update on their personal blogs.
The list goes on and on, I don’t think I will ever exhaust the list. The webpages are not limited, but you time is.
0. Make a plan. Planning works wonder for me, psychologically keeping a streak feels good. Giving yourself a target helps you to achieve it, and your progress will be easy to track. Update your plan adjusting to your schedule and needs.
1. Learn everyday, code everyday. Coding builds on both your mind and your muscle, you can retain the knowledge better if you have done it, for multiple times.
2. Making mistakes is okay, you are making progress. Many blogs and videos have emphasized this, it is normal to feel negative being constantly stuck and making mistakes as a beginner. You are not alone, just fix that mistake and know better the next time.
3. Participate in Kaggle competitions to hone skills and learn as a beginner.
4. Make projects to maintain interests, showcase skills to potential employers.
5. When in doubt, google. Python and data science has a large community of learners who are willing to answer questions and share their experiences.
6. Contribute to community. When you gain more experience, share your ideas and answer other people’s questions. As a beginner, I have been taking more than I gave, so now I am writing this article to share my experience. Sharing will give you a sense of fulfilment too.
7. Stay positive. It is said that fake your smile then you will be smiling, a positive action CAN result in positive feeling.
Once in a while, I felt that I am not making enough process or I am stuck with an issue, I had to shut down the negative thoughts and tell myself how much I have achieved and how well I am doing. I will try to focus on my problem and complete what I have planned for the day. If I am still stuck, I will sleep it over and next day I can see my problem with a clearer mind and better mood.
My Progress (If you are interested):
My thought process and above explanations already gave out how I learned. So here I just briefly summarize my learning journey.
- Complete Machine Learning by Stanford on Coursera in one month.
Note: This course is good for concepts, the assignments are completed by Matlab. For those who want to learn Python in the first place, try this course.
- Take my university’s 1-day crash course: Introduction to Data Science.
Note: Got introduced Kaggle learn, Harvard class and this handbook.
- Kaggle Learn: Python, Machine learning, pandas in one month.
Kaggle’s course are most efficient to learn, I completed these and continue to use codecademy to build my foundations.
- Codecademy’s Data Science learning path in two months.
Note: it covers everything: Python, pandas, SQL, statistics, Git, Shell, scikit-learn and so on. The pace is slow sometimes, just skip it.
When I have the time to learn R, I will start from this website too.
- Udacity’s Data Scientist Nano-degree programme, in progress (2/4 completed in one month).
Note: As it is the most expensive course, I only took it after acquiring all the basics skills using Codecademy.
The projects given are practical, and appropriate amount of guide is given. They also give feedbacks on my Github, LinkedIn profiles, and resumes.
Tip: try the cancel subscription trick.
While most of materials online are free, you can surely learn a lot without paying a penny. However, if you can spare some money, and want to progress more quickly and have guidance, do consider these fees as investment in your career. In the end, know your destination and head towards it.
You are welcome to connect with me on LinkedIn.
Good luck to everyone ! Keep learning, Stay positive!