This lesson is still being designed and assembled (Pre-Alpha version)

An Introduction to Scikit-Learn and Pandas

Overview

Teaching: 0 min
Exercises: 0 min
Questions
  • What is Scikit-Learn?

  • What is Pandas?

Objectives
  • First learning objective. (FIXME)

About Scikit-Learn

Scikit-Learn is an open-source toolkit for machine learning in Python. It is built upon well-established Python numerical and scientific computation toolboxes: NumPy, SciPy, as well as Matplotlib.

Scikit-learn contains many supervised machine learning methods (both classification and regression methods), as well as unsupervised methods (clustering). In addition, there are also additional algorithms to reduce the dimensionality of the problem (e.g. the principal component analysis [PCA]), and many convenience tools to prepare the data, measure the accuracy of the trained ML algorithm.

Scikit-learn front web page

The project website is https://scikit-learn.org/. The website is full of helpful tutorials, documentation, as well as technical instructions. For every method, there is a technical explanation about the method, including pointers about the advantage and disadvantages of the ML method; but there is also detailed programming documentation (the API [Application Programming Interface]). Each documentation is accompanied by plenty of examples.

As an example, consider the support vector machines (SVM) method:

While learning about ML techniques in this workshop, it is highly recommended that you also consult Scikit-Learn’s website for further information. It is an exceptional resource for learning and reference purposes.

About Pandas

Pandas is a data analysis library for Python. While Scikit-Learn provides the machine-learning capabilities to Python, Pandas provides powerful and convenience data-handling capabilities that complements Scikit-Learn, NumPy, etc. Pandas is using NumPy arrays under the hood, but it provides powerful data analytics capabilities, which are not part of NumPy capabilities. One can perform filtering, data transformation, joining of two or more DataFrames, etc. using Pandas; these steps are often necessary in preparation of machine learning.

In this lesson, we choose to use Pandas because it works well in conjunction with Scikit-Learn. (Pandas also work well with popular deep learning frameworks such as KERAS.) At the end of this lesson, we will briefly present an alternative to perform ML workflow on Spark, using a library called “MLlib”.

Key Points

  • Scikit-Learn provides machine learning capabilities for Python.

  • Pandas provides data handling and analytic tools for Python.