Hello everyone!
As we all know, Data Science is a domain in Computer Science that deals with heavy usage of data and it's features. As a result, there is a great need to get acquainted with the tools that are essential to carry out such heavy tasks on our machines, so that you can learn and implement your work in an efficient manner.
Here is a list of must-have tools, along with the stable versions for each one of them, that we have come up with for Data Science Beginners and want to pursue their career in the same.
Anaconda (Not the big snake)
To begin with, we must install Anaconda: https://www.anaconda.com/distribution/
It is the easiest way to run various algorithms and data processing in Python/R for data science and machine learning on Linux, Windows, and Mac OS X.
In simpler terms, Anaconda creates an environment on your machine that lets you install various tools and utilities only on that environment and not the actual machine. Also, it comes pre-installed with most of the modules required for machine learning and data science.
With the virtual environment feature, one can run two different versions of the same tool as per the requirement on the two different tasks.
Once you have installed Anaconda on your machine, you can run the conda
command and see if it's installed correctly.
For example:
If a person is working on two different projects: one requiring a version 3.6 of Python and another requiring version 3.2 of Python.
To cater to both of these requirements, one can create two separate environments by making use of the following command (in Linux):
conda create --name my_env python=3.2
Here, we have mentioned python=3.2 to create an environment with version 3.2 on it.
Similarly, we can create a different environment with python=3.6 on it.
Now let's go through the list of tools to setup on your Anaconda environment to get started with the implementation work:
1. Python 3.6:
One of the major reasons programmers prefer to code in Python is its simplicity and readability. Python, unlike most of the other programming languages, emphasizes code readability and simplicity and allows the developer to use English keywords instead of punctuations. Along with this, it has many libraries/modules which are pre-loaded with functions and algorithms for data analysis and machine learning.
2. Cudatoolkit (version 9.0):
One of the major reasons behind using the NVIDIA Cudatoolkit is to get a development environment that helps the developers to create a high-performance Graphical Processing Unit (GPU) accelerated applications.
The Cudatoolkit also includes various GPU-accelerated libraries, debugging and optimization tools, a C/C++ compiler and a runtime library that proves to be of great help to the developers, to deploy their applications.
The following command can be used to install this tool on your Anaconda environment (for Ubuntu):
conda install -c anaconda cudatoolkit
3. NumPy (version 1.15.4 ):
NumPy is a Python library/package that stands for Numerical Python.
It is the core library for scientific computing, which contains a powerful n-dimensional array object as well as provides various tools for integration with C, C++, etc.
The various advantages that NumPy has over the lists in Python are: fastness, less memory usage and ease to use.
The following command can be used to install this tool on your Anaconda environment (for Ubuntu):
conda install -c anaconda numpy
4. Pandas (version 0.24.1):
In simple terms, Pandas is used to clean the data so that it could be used efficiently for carrying out the analysis.
Some of the major functionalities offered by Pandas, that makes it a good choice for the data analysts are data cleaning, data manipulation, and data analysis.
When the back-end source code is purely written in either C or Python, then Pandas provides highly optimized performance.
The following command can be used to install this tool on your Anaconda environment (for Ubuntu):
conda install -c anaconda pandas
5. Pytorch(version 0.4.1):
One of the major reasons that make Pytorch a go-to library for the data scientists is its ease in building extremely complex neural networks.
Pytorch also provides the developers with the functionality to run and test part of the code in real-time. As a result, the deep learning scientists, Machine Learning developers, and the Neural Network debuggers don’t have to wait for the entire code to be executed to conclude whether the code works or not. This is of great help while writing a large code, which is done very often while implementing various models in ML.
The following command can be used to install this tool on your Anaconda environment (for Ubuntu):
conda install -c pytorch pytorch
6. Pytorch-pretrained-bert (version 0.4.0):
Pytorch has already been discussed above.
Pretrained: In Deep Learning, as the name suggests, Pre-training is nothing but, training the machines, before they start doing a particular task.
BERT: BERT stands for Bidirectional Encoder Representations from Transformers. It is a neural network-based technique for natural language processing pre-training. In simple terms, BERT can be used to help Google to discern the context of words in search queries in an efficient manner.
To install this package with conda
command, run any one of the following commands:
conda install -c conda-forge pytorch-pretrained-bert
## OR
conda install -c conda-forge/label/cf201901 pytorch-pretrained-bert
## OR
conda install -c conda-forge/label/cf202003 pytorch-pretrained-ber
7. Scikit-learn (version 0.20.1):
One of the major reasons that make Scikit-learn a go-to library for the data scientists is its ability to implement various Applied Machine Learning functionalities.
Scikit-learn toolkit provides simple and efficient tools for predictive data analysis. This toolkit is built on various Python libraries such as NumPy, SciPy, and Matplotlib.
To install this package with conda
command, run the following command:
conda install -c anaconda scikit-learn
8. SciPy (version 1.1.0):
SciPy is a Python library that stands for Scientific Python. It uses NumPy for more mathematical functions.
SciPy uses NumPy arrays as the basic data structure and comes with modules for various commonly used tasks in scientific programming that mostly includes Linear Algebra (solving linear equations using matrices, Linear dependence and independence using matrices, etc.), Calculus (Integration, Differentiation, etc.), solving ordinary differential equation and various signal processing tasks.
To install this package with conda
command, run the following command:
conda install -c anaconda scipy
Conclusion:
One should note that the choice of the tools completely depends on the domain and the type of work under consideration. The tools mentioned above are some of the most used and recommended tools for anyone to get ahead in the field of Data Science.
We hope that you have found this post helpful. I will be coming up with more such informative posts to help you get ahead.
Thanks for reading! Hope you make the best out of this Quarantine Period and utilize it to get ahead in your field of interests : )