An Introduction to Scikit-Learn and Pandas
Overview
Teaching: 0 min
Exercises: 0 minQuestions
What is Scikit-Learn?
What is Pandas?
Objectives
First learning objective. (FIXME)
About Scikit-Learn
Scikit-Learn is an open-source toolkit for machine learning in Python. It is built upon well-established Python numerical and scientific computation toolboxes: NumPy, SciPy, as well as Matplotlib.
Scikit-learn contains many supervised machine learning methods (both classification and regression methods), as well as unsupervised methods (clustering). In addition, there are also additional algorithms to reduce the dimensionality of the problem (e.g. the principal component analysis [PCA]), and many convenience tools to prepare the data, measure the accuracy of the trained ML algorithm.
The project website is https://scikit-learn.org/. The website is full of helpful tutorials, documentation, as well as technical instructions. For every method, there is a technical explanation about the method, including pointers about the advantage and disadvantages of the ML method; but there is also detailed programming documentation (the API [Application Programming Interface]). Each documentation is accompanied by plenty of examples.
As an example, consider the support vector machines (SVM) method:
-
This website provides the technical overview of SVM;
-
The generated API documentation provides the exact details of the
SVC
class. -
Further down the API document, one will find many worked examples on the use of SVM method.
While learning about ML techniques in this workshop, it is highly recommended that you also consult Scikit-Learn’s website for further information. It is an exceptional resource for learning and reference purposes.
About Pandas
Pandas is a data analysis library for Python. While Scikit-Learn provides the machine-learning capabilities to Python, Pandas provides powerful and convenience data-handling capabilities that complements Scikit-Learn, NumPy, etc. Pandas is using NumPy arrays under the hood, but it provides powerful data analytics capabilities, which are not part of NumPy capabilities. One can perform filtering, data transformation, joining of two or more DataFrames, etc. using Pandas; these steps are often necessary in preparation of machine learning.
In this lesson, we choose to use Pandas because it works well in conjunction with Scikit-Learn. (Pandas also work well with popular deep learning frameworks such as KERAS.) At the end of this lesson, we will briefly present an alternative to perform ML workflow on Spark, using a library called “MLlib”.
Key Points
Scikit-Learn provides machine learning capabilities for Python.
Pandas provides data handling and analytic tools for Python.