In this Unsupervised machine learning tutorial, we will be discussing about categorical data encoding.
Data encoding is a process by which data is converted into a format that can be transmitted or stored. Data encoding is necessary for the efficient storage and retrieval of data.
Data encoding in data science involves converting raw data into a more compact form. This is done to reduce the size of the data, so that it can be stored on devices with limited space or transmitted over networks with limited bandwidth. Data encoding also helps to ensure that the information in the encoded file will not be corrupted during transmission and storage.
There are many different types of data encodings, including ASCII, Unicode, binary, hexadecimal and XML ASCII.
A categorical variable is a variable that can only have a finite number of values. For example, the variable ‘Gender’ could be either male or female.
The encoding of categorical variables is the process of assigning one or more values to each possible value in a categorical variable. For example, if there are five possible values for the ‘Gender’ variable, then the encoding would assign numbers from 1 to 5 to each possible value.
There are two types of encoding: nominal and ordinal. When categorical variables are encoded as nominal, they are assigned an arbitrary number from 1 to n (where n is equal to the total number of categories). When they are encoded as ordinal, they are assigned an arbitrary order from 1 to k (where k is equal to the total number of categories).
Encoding categorical data in machine learning helps to reduce the number of features (or predictors) and it also reduces the dimensionality of the resulting dataset.
The advantages of encoding categorical data in machine learning are:
-Minimises model complexity and prevents overfitting
-Improves interpretability and understandability
-Increases predictive performance
Share a personalized message with your friends.