When working with Pandas dataframes you’ll often need to convert values from one format to another. For example, you might need to convert a string to a float or an integer, or convert a datetime string to a datetime object. You might also need to create new columns of data that’s been normalised or standardised using a variety of mathematical techniques to help combat skewness.

In this post, we’ll cover how to convert Pandas column values into other formats including float, int, datetime, and create new columns of data that’s been normalised or standardised, so you can use it in your analysis or models and overcome skewness problems.

The `float`

dtype refers to floating point or decimal numbers, such as 0.424. You can convert a column to a float using the `astype`

method by appending `.astype(float)`

to the end of the column name. If you have a number of Pandas columns that you want to cast to float, it’s easiest to create a helper function to do the job for you.

```
def cols_to_float(df, columns):
"""Convert selected column values to float and return DataFrame.
Args:
df: Pandas DataFrame.
columns: List of columns to convert.
Returns:
Original DataFrame with converted column data.
"""
for col in columns:
df[col] = df[col].astype(float)
return df
```

Integers or `int`

are whole numbers, such as 42. You can convert a column to an integer using the `astype`

method by appending `.astype(int)`

to the end of the column name. If you have a number of Pandas columns that you want to cast to int, you can create a similar function but cast to `int`

instead of `float`

.

```
def cols_to_int(df, columns):
"""Convert selected column values to int and return DataFrame.
Args:
df: Pandas DataFrame.
columns: List of columns to convert.
Returns:
Original DataFrame with converted column data.
"""
for col in columns:
df[col] = df[col].astype(int)
return df
```

If you’re working with dates in Pandas, it’s important that they’re stored as the correct dtype, since the inner workings of Pandas date functions are dependent on the dtype. You can convert a column to a datetime using the `to_datetime`

function and define the format of the date using the `format`

parameter.

```
def cols_to_datetime(df, columns):
"""Convert selected column values to datetime and return DataFrame.
Args:
df: Pandas DataFrame.
columns: List of columns to convert.
Returns:
Original DataFrame with converted column data.
"""
for col in columns:
df[col] = pd.to_datetime(df[col], format='%Y%m%d')
return df
```

For some analyses, it can be useful to convert positive values to negative. As you may recall from school, you can convert a positive value to a negative by multiplying it by -1. The below function will take the dataframe and a list of columns and will then convert each value to a negative.

```
def cols_to_negative(df, columns):
"""Convert selected column values to negative and return DataFrame.
Args:
df: Pandas DataFrame.
columns: List of columns to convert.
Returns:
Original DataFrame with converted column data.
"""
for col in columns:
df[col] = df[col] * -1
return df
```

Log values are also very useful in data science. You can convert a column to a log value using the `np.log`

function. The below function will take the dataframe and a list of columns and will then convert each value to a log value.

```
def cols_to_log(df, columns):
"""Transform column data with log and return new columns of prefixed data.
For us with data where the column values do not include zeroes.
Args:
df: Pandas DataFrame.
columns: List of columns to transform.
Returns:
Original DataFrame with additional prefixed columns.
"""
for col in columns:
df['log_' + col] = np.log(df[col])
return df
```

The log function cannot be applied to zero values, so we add 1 to each value before applying the log function. The below function will take the dataframe and a list of columns and will then convert each value to a log+1 value.

```
def cols_to_log1p(df, columns):
"""Transform column data with log+1 and return new columns of prefixed data.
For use with data where the column values include zeroes.
Args:
df: Pandas DataFrame.
columns: List of columns to transform.
Returns:
Original DataFrame with additional prefixed columns.
"""
for col in columns:
df['log1p_' + col] = np.log(df[col] + 1)
return df
```

The log max root is another useful transformation. The below function will take the dataframe and a list of columns and will then convert each value to a log max root value. It is used when the data contains zeroes.

```
def cols_to_log_max_root(df, columns):
"""Convert data points to log values using the maximum value as the log max and return new columns of prefixed data.
For use with data where the column values include zeroes.
Args:
df: Pandas DataFrame.
columns: List of columns to transform.
Returns:
Original DataFrame with additional prefixed columns.
"""
for col in columns:
log_max = np.log(df[col].max())
df['logmr_' + col] = df[col] ** (1 / log_max)
return df
```

The hyperbolic tangent or `tanh`

transformation is another useful transformation. The below function will take the dataframe and a list of columns and will then convert each value to a `tanh`

value.

```
def cols_to_tanh(df, columns):
"""Transform column data with hyperbolic tangent and return new columns of prefixed data.
Args:
df: Pandas DataFrame.
columns: List of columns to transform.
Returns:
Original DataFrame with additional prefixed columns.
"""
for col in columns:
df['tanh_' + col] = np.tanh(df[col])
return df
```

There are various ways to scale data points to values between 0 and 1. The sigmoid function is one way to achieve this. The below function will take the dataframe and a list of columns and will then convert each value to a sigmoid value.

```
def cols_to_sigmoid(df, columns):
"""Convert data points to values between 0 and 1 using a sigmoid function and return new columns of prefixed data.
Args:
df: Pandas DataFrame.
columns: List of columns to transform.
Returns:
Original DataFrame with additional prefixed columns.
"""
for col in columns:
e = np.exp(1)
y = 1 / (1 + e ** (-df[col]))
df['sig_' + col] = y
return df
```

The cube root will also convert data points to values between 0 and 1. The below function will take the dataframe and a list of columns and will then convert each value to a cube root value.

```
def cols_to_cube_root(df, columns):
"""Convert data points to their cube root value so all values are between 0-1 and return new columns of prefixed data.
Args:
df: Pandas dataframe.
columns: List of columns to transform.
Returns:
Original dataframe with additional prefixed columns.
"""
for col in columns:
df['cube_root_' + col] = df[col] ** (1 / 3)
return df
```

The next way of standardising values is the normalized cube root, which will also return values between 0 and 1. The below function will take the dataframe and a list of columns and will then convert each value to a normalized cube root value by determining the `min()`

and `max()`

values for each column.

```
def cols_to_cube_root_normalize(df, columns):
"""Convert data points to their normalized cube root value so all values are between 0-1 and return new columns of prefixed data.
Args:
df: Pandas DataFrame.
columns: List of columns to transform.
Returns:
Original DataFrame with additional prefixed columns.
"""
for col in columns:
df['cube_root_' + col] = (df[col] - df[col].min()) / (df[col].max() - df[col].min()) ** (1 / 3)
return df
```

Normalization is perhaps the most common way to scale data to values between 0 and 1. The below function will take the dataframe and a list of columns and will then convert each value to a normalized value by determining the `min()`

and `max()`

values for each column.

```
def cols_to_normalize(df, columns):
"""Convert data points to values between 0 and 1 and return new columns of prefixed data.
Args:
df: Pandas DataFrame.
columns: List of columns to transform.
Returns:
Original DataFrame with additional prefixed columns.
"""
for col in columns:
df['norm_' + col] = (df[col] - df[col].min()) / (df[col].max() - df[col].min())
return df
```

When zeroes are present in the columns, you can also use normalization with a log+1 transformation. Again, this will return values between 0 and 1. The below function will take the dataframe and a list of columns and will then convert each value to a log+1 normalized value by determining the `min()`

and `max()`

values for each column.

```
def cols_to_log1p_normalize(df, columns):
"""Transform column data with log+1 normalized and return new columns of prefixed data.
For use with data where the column values include zeroes.
Args:
df: Pandas DataFrame.
columns: List of columns to transform.
Returns:
Original DataFrame with additional prefixed columns.
"""
for col in columns:
df['log1p_norm_' + col] = np.log((df[col] - df[col].min()) / (df[col].max() - df[col].min()) + 1)
return df
```

Percentile linearization will rank each data point by its percentile. To do this we can use the Pandas `rank()`

function and a `lambda`

function. The below function will take the dataframe and a list of columns and will then convert each value to a percentile linearized value.

```
def cols_to_percentile(df, columns):
"""Convert data points to their percentile linearized value and return new columns of prefixed data.
Args:
df: Pandas DataFrame.
columns: List of columns to transform.
Returns:
Original DataFrame with additional prefixed columns.
"""
for col in columns:
df['pc_lin_' + col] = df[col].rank(method='min').apply(lambda x: (x - 1) / len(df[col]) - 1)
return df
```

Matt Clarke, Tuesday, September 27, 2022