How good is the course from Imarticus

Data Science Learning Curve: What Are The Low Hanging Fruits?

Many young graduates, my mentees and business people often ask me the best way to get your hands dirty with data science.

My answer varies, but often few things match my advice, which I will try to explain in this article.

Most of the time I also find the discussion deviating and moving on to discuss machine learning, statistics, data analysis and the like like artificial intelligence.

If it were up to me, I would give a name to all of these technologies, considering how they stand side by side.

Having these discussions is important, but if not handled well, the information becomes overwhelming and confusing.

In this article I limit myself more to data science and machine learning.

I will also take a broader look at the concepts, which I will examine and explain in more detail in the notes that follow.

Then what is data science? Why is it important? In the simplest sense, data science is the study of data in order to gain insights from it and use this knowledge for many things, including to recognize patterns, make decisions and identify trends, business opportunities or threats.

For example, business opportunities could be identifying products to cross-sell while a user is still in an online shopping cart, and business threats could be identifying potential credit defaulters or customers switching service providers.

Data science uses a combination of technologies such as statistics, machine learning, programming, and visualization to achieve its goals.

One way to formalize data science projects is to follow what is known as the data science lifecycle, or sometimes the data analytics lifecycle.

This process has many phases, which in themselves can be independent data science disciplines.

As mentioned earlier, in future articles I will endeavor to examine the various stages of this life cycle in more detail.

Data science is important because, with the proliferation of data generation agents and the vastly different formats for which that data requires the necessary and appropriate sciences, so that such data can be consumed in a useful and efficient way must evolve Could not have been analyzed for a decade.

New techniques inherent in the data science ecosystem can aid in the ingestion and analysis of such data formats.

For example, most website data is not structured in traditional ways, e.g. B. in rows and columns, but in HTML with different structures, ie without a predefined scheme.

Tools available in the Python programming language can be used to search it for insights such as sentiment on a stock market.

What is machine learning and why is it important? In its simplest form, machine learning is a subset of data science and a discipline that trains machines on one set of data so that when fed with another set of data that they haven't seen, they'll be able to, too, to be informed Make decisions or carry out tasks based on the training they have received.

In other words, these are systems that learn from experience. Problems that can be solved by machine learning include predictions, classifications, recommendations, or pattern recognition, among others.

In future articles, I will discuss supervised, unsupervised, and reinforcement machine learning, as well as individual problems to be solved by machine learning, and the many machine learning algorithms available.

The motivations for wanting data science also vary, but over the years I've grouped them into roughly four main categories: career development, smart start-up solutions, research and business opportunities and threats.

Various job websites suggest that a career in data science has great prospects for a higher salary than most other data-driven jobs in many countries around the world.

The demand for the skills of data scientist seems to exceed the supply available.

Everywhere the buzzword in the many technology conferences is “start-up” and is synonymous with computer-aided intelligent solutions, be it apps or solutions embedded in electronic devices.

Another common motivation is scientific research. Most of the students who look through such projects for their degrees find themselves in this area after graduation.

Most companies will look for smart solutions to identify opportunities and risks. Typically, they hire consultants to develop models capable of identifying new opportunities in the market, as well as identifying and advising possible solutions to prevent possible threats.

How can a budding data scientist accelerate their ambitions or shorten the learning curve? In this learning curve, I will identify the low hanging fruits that can serve as a prerequisite for more "exotic" or more advanced aspects of the field.

Below is a list of things that I think are low hanging fruits on the data science learning curve.

Understanding statistics is important for several roles that can be identified as a whole data science life cycle or ecosystem.

For example, an important early preparatory phase in the ecosystem is visualizing the data and producing some basic statistics to identify such measures of central tendency, dispersion, and correlations.

Most data science or machine learning experiments require statistical interpretation, such as a confusion matrix that explains the number of false positive or negative results statistically, significantly, and IS NULL. Hypotheses rejected?

The accuracy of the data science models is also measured using values ​​that need to be statistically interpreted to make sense.

Expertise in a subject is not a prerequisite per se, but it helps to control the requirements of the data science experiment at hand

For example, when designing a data science project that is scraping the internet for sentiment in stock markets, it would be helpful for that scientist to have an understanding of things like opens, closes, lows, highs in stock prices, and other aspects of stock market or fundamental analysis .

I was able to create a poultry dataset that I have since published on Kaggle.com based on my knowledge of poultry.

As in any other science, the need for appropriate apparatus cannot be overemphasized. The cost of trading tools is often cited as a barrier to entry.

This can be mitigated by opting for open source tools like Python or R programming languages ​​that can deliver end-to-data science projects.

These are open source languages ​​with many frameworks that support machine learning, web development, game development, and data visualization, among others.

Care should be taken with open source that it is published under conditions that allow potential users the privilege of using only under certain conditions.

Microsoft got to the party with Azure Machine Learning Studio with no payment required for the basic configuration of its trials, as long as the terms and conditions are followed.

Azure Machine Learning Studio is a Windows-based, easy-to-use, intuitive tool for interactively building and deploying machine learning solutions.

With the appropriate research, an appropriate set of open source tools can be organized to support “startups” or even established companies can do most of the work that requires software technology, such as: B. End-to-end, product development for almost no cost.

Machine learning skills are a must to qualify as a data scientist.

A data scientist should be able to tailor a problem to appropriate machine learning technologies or algorithms.

For example, if the task is to predict binary decisions, logistic regression is the most appropriate algorithm

Data visualization is important in the data science lifecycle because the results of most data science experiments need to be presented in such a way that they convey a meaningful message to the intended audience.

Some common commercial visualization tools are Qlikview, Power BI, MicroStrategy, Pentaho, and Tableau. In addition to visualization, most of these tools also offer other functions such as ETL (Extract Transform and Load) and ETL.

Given that cost has been identified as a barrier to entry in many other cases, most of these software offer trial versions of their software for specific periods of time after which they expect the user to purchase.

Other providers offer an endless trial version for registered students with limited functionality and as long as the general terms and conditions are followed.

For example, a solution resulting from a trial cannot be sold or shared. Some open source tools also have visualization functions. For example, Python has several frameworks that support visualization.

These five steps are in no way prescriptive, but if mastered it will surely mean a "short circuit", the long road into the realm of data science and its larger ecosystem.

Another question that needs to be asked is how can these skills be acquired in a way that is simple, inexpensive, and timely?

The following list includes some of the resources that I have found helpful on my own data science journey

I think the internet is the best "inexpensive" university of our time. The quality of the material varies.

There are plenty of materials online that can be overwhelming for someone looking for the right material, but over time the noise can be filtered out.

Several renowned universities offer free courses through Massive Open Online Courses (MOOCs).

These courses are mainly offered online by the tutors who teach the full-time students.

I have found some of the offers from Harvard and MIT particularly useful.

Udemy.com offers a variety of courses at reasonable prices.

Most of the providers of these courses are also senior faculty members at leading universities or people who started their successful businesses.

kaggle.com has a very rich source of material, especially for those with a tendency to use Python as a tool of trade.

You can also find good material on YouTube if you refine your search well enough. I learned Python Web and scraped it on YouTube presented by a Youtuber from India.

The video I used was presented in a mixture of Hindi and English. Although I didn't understand a single word in Hindi, I was able to follow the lecture

Here in South Africa I find the part-time offer from the “Enterprise Workplace Skills Plan (WSP)” at the University of Pretoria to offer particularly good part-time courses in data science.

One such course that I took once is a six-day course called "Applied Machine Learning".
After arguing that cost is an obstacle, I would also like to point out that certain investment efforts must be made if this desire to become a data scientist is to be realized.

Considering this is an investment in future income, there are some costs to be incurred. There are many resources that can be obtained from many sources.

I bought very useful material Kindle books on Amazon for up to $ 2 and they served me well.

As Jim Rohn once said, “Investing in education; formal education, you will make a living; Self-education makes you a fortune "

In conclusion, I hope I have shared some light on the ways in which one can become a data scientist.

I believe these steps are the low hanging fruit that would be easy to pick on a journey to becoming a data scientist.

In future notes on the subject, I will elaborate on selected aspects of the subject.


About the author

Phuzo Soko is Senior Business Intelligence Manager / Data Scientist at an insurance company in Johannesburg, South Africa. He is interested in business intelligence, machine learning, data science and artificial intelligence. He has a BTech: in Software Development from Tshwane University of Technology, a Certificate in Cyber ​​Security from the University of Johannesburg, and a Certificate in Finance and Investment from the University of Witwatersrand.


Featured image: imarticus.org


Don't miss any important articles during the week. Subscribe to something cfamedia weekly newsletter for updates.