How to transform numeric Pandas dataframe column values

Picture by Markus Spiske, Pexels.

15 minutes to read

Data Science Pandas

When working with Pandas dataframes you’ll often need to convert values from one format to another. For example, you might need to convert a string to a float or an integer, or convert a datetime string to a datetime object. You might also need to create new columns of data that’s been normalised or standardised using a variety of mathematical techniques to help combat skewness.

In this post, we’ll cover how to convert Pandas column values into other formats including float, int, datetime, and create new columns of data that’s been normalised or standardised, so you can use it in your analysis or models and overcome skewness problems.

Convert Pandas column values to float

The float dtype refers to floating point or decimal numbers, such as 0.424. You can convert a column to a float using the astype method by appending .astype(float) to the end of the column name. If you have a number of Pandas columns that you want to cast to float, it’s easiest to create a helper function to do the job for you.

def cols_to_float(df, columns):
    """Convert selected column values to float and return DataFrame.
    Args:
        df: Pandas DataFrame.
        columns: List of columns to convert.
    Returns:
        Original DataFrame with converted column data.
    """

    for col in columns:
        df[col] = df[col].astype(float)

    return df

Convert Pandas column values to int

Integers or int are whole numbers, such as 42. You can convert a column to an integer using the astype method by appending .astype(int) to the end of the column name. If you have a number of Pandas columns that you want to cast to int, you can create a similar function but cast to int instead of float.

def cols_to_int(df, columns):
    """Convert selected column values to int and return DataFrame.
    Args:
        df: Pandas DataFrame.
        columns: List of columns to convert.
    Returns:
        Original DataFrame with converted column data.
    """

    for col in columns:
        df[col] = df[col].astype(int)

    return df

Convert Pandas column values to datetime

If you’re working with dates in Pandas, it’s important that they’re stored as the correct dtype, since the inner workings of Pandas date functions are dependent on the dtype. You can convert a column to a datetime using the to_datetime function and define the format of the date using the format parameter.

def cols_to_datetime(df, columns):
    """Convert selected column values to datetime and return DataFrame.
    Args:
        df: Pandas DataFrame.
        columns: List of columns to convert.
    Returns:
        Original DataFrame with converted column data.
    """

    for col in columns:
        df[col] = pd.to_datetime(df[col], format='%Y%m%d')

    return df

Convert Pandas column values to negatives

For some analyses, it can be useful to convert positive values to negative. As you may recall from school, you can convert a positive value to a negative by multiplying it by -1. The below function will take the dataframe and a list of columns and will then convert each value to a negative.

def cols_to_negative(df, columns):
    """Convert selected column values to negative and return DataFrame.
    Args:
        df: Pandas DataFrame.
        columns: List of columns to convert.
    Returns:
        Original DataFrame with converted column data.
    """

    for col in columns:
        df[col] = df[col] * -1

    return df

Convert Pandas column values to log

Log values are also very useful in data science. You can convert a column to a log value using the np.log function. The below function will take the dataframe and a list of columns and will then convert each value to a log value.

def cols_to_log(df, columns):
    """Transform column data with log and return new columns of prefixed data.
    For us with data where the column values do not include zeroes.
    Args:
        df: Pandas DataFrame.
        columns: List of columns to transform.
    Returns:
        Original DataFrame with additional prefixed columns.
    """

    for col in columns:
        df['log_' + col] = np.log(df[col])

    return df

Convert Pandas column values to log+1

The log function cannot be applied to zero values, so we add 1 to each value before applying the log function. The below function will take the dataframe and a list of columns and will then convert each value to a log+1 value.

def cols_to_log1p(df, columns):
    """Transform column data with log+1 and return new columns of prefixed data.
    For use with data where the column values include zeroes.
    Args:
        df: Pandas DataFrame.
        columns: List of columns to transform.
    Returns:
        Original DataFrame with additional prefixed columns.
    """

    for col in columns:
        df['log1p_' + col] = np.log(df[col] + 1)

    return df

Convert Pandas column values to log max root

The log max root is another useful transformation. The below function will take the dataframe and a list of columns and will then convert each value to a log max root value. It is used when the data contains zeroes.

def cols_to_log_max_root(df, columns):
    """Convert data points to log values using the maximum value as the log max and return new columns of prefixed data.
    For use with data where the column values include zeroes.
    Args:
        df: Pandas DataFrame.
        columns: List of columns to transform.
    Returns:
        Original DataFrame with additional prefixed columns.
    """

    for col in columns:
        log_max = np.log(df[col].max())
        df['logmr_' + col] = df[col] ** (1 / log_max)

    return df

Convert Pandas column values to their hyperbolic tangent or tanh

The hyperbolic tangent or tanh transformation is another useful transformation. The below function will take the dataframe and a list of columns and will then convert each value to a tanh value.

def cols_to_tanh(df, columns):
    """Transform column data with hyperbolic tangent and return new columns of prefixed data.
    Args:
        df: Pandas DataFrame.
        columns: List of columns to transform.
    Returns:
        Original DataFrame with additional prefixed columns.
    """

    for col in columns:
        df['tanh_' + col] = np.tanh(df[col])

    return df

Convert Pandas column values to 0 to 1 using sigmoid

There are various ways to scale data points to values between 0 and 1. The sigmoid function is one way to achieve this. The below function will take the dataframe and a list of columns and will then convert each value to a sigmoid value.

def cols_to_sigmoid(df, columns):
    """Convert data points to values between 0 and 1 using a sigmoid function and return new columns of prefixed data.
    Args:
        df: Pandas DataFrame.
        columns: List of columns to transform.
    Returns:
        Original DataFrame with additional prefixed columns.
    """

    for col in columns:
        e = np.exp(1)
        y = 1 / (1 + e ** (-df[col]))
        df['sig_' + col] = y

    return df

Convert Pandas column values to 0 to 1 using cube root

The cube root will also convert data points to values between 0 and 1. The below function will take the dataframe and a list of columns and will then convert each value to a cube root value.

def cols_to_cube_root(df, columns):
    """Convert data points to their cube root value so all values are between 0-1 and return new columns of prefixed data.
    Args:
        df: Pandas dataframe.
        columns: List of columns to transform.
    Returns:
        Original dataframe with additional prefixed columns.
    """

    for col in columns:
        df['cube_root_' + col] = df[col] ** (1 / 3)

    return df

Convert Pandas column values to 0 to 1 using normalized cube root

The next way of standardising values is the normalized cube root, which will also return values between 0 and 1. The below function will take the dataframe and a list of columns and will then convert each value to a normalized cube root value by determining the min() and max() values for each column.

def cols_to_cube_root_normalize(df, columns):
    """Convert data points to their normalized cube root value so all values are between 0-1 and return new columns of prefixed data.
    Args:
        df: Pandas DataFrame.
        columns: List of columns to transform.
    Returns:
        Original DataFrame with additional prefixed columns.
    """

    for col in columns:
        df['cube_root_' + col] = (df[col] - df[col].min()) / (df[col].max() - df[col].min()) ** (1 / 3)

    return df

Convert Pandas column values to 0 to 1 using normalization

Normalization is perhaps the most common way to scale data to values between 0 and 1. The below function will take the dataframe and a list of columns and will then convert each value to a normalized value by determining the min() and max() values for each column.

def cols_to_normalize(df, columns):
    """Convert data points to values between 0 and 1 and return new columns of prefixed data.
    Args:
        df: Pandas DataFrame.
        columns: List of columns to transform.
    Returns:
        Original DataFrame with additional prefixed columns.
    """

    for col in columns:
        df['norm_' + col] = (df[col] - df[col].min()) / (df[col].max() - df[col].min())

    return df

Convert Pandas column values to log+1 normalized

When zeroes are present in the columns, you can also use normalization with a log+1 transformation. Again, this will return values between 0 and 1. The below function will take the dataframe and a list of columns and will then convert each value to a log+1 normalized value by determining the min() and max() values for each column.

def cols_to_log1p_normalize(df, columns):
    """Transform column data with log+1 normalized and return new columns of prefixed data.
    For use with data where the column values include zeroes.
    Args:
        df: Pandas DataFrame.
        columns: List of columns to transform.
    Returns:
        Original DataFrame with additional prefixed columns.
    """

    for col in columns:
        df['log1p_norm_' + col] = np.log((df[col] - df[col].min()) / (df[col].max() - df[col].min()) + 1)

    return df

Convert Pandas column values to their percentile linearized value

Percentile linearization will rank each data point by its percentile. To do this we can use the Pandas rank() function and a lambda function. The below function will take the dataframe and a list of columns and will then convert each value to a percentile linearized value.

def cols_to_percentile(df, columns):
    """Convert data points to their percentile linearized value and return new columns of prefixed data.
    Args:
        df: Pandas DataFrame.
        columns: List of columns to transform.
    Returns:
        Original DataFrame with additional prefixed columns.
    """

    for col in columns:
        df['pc_lin_' + col] = df[col].rank(method='min').apply(lambda x: (x - 1) / len(df[col]) - 1)

    return df

Matt Clarke, Tuesday, September 27, 2022

Matt Clarke Matt is an Ecommerce and Marketing Director who uses data science to help in his work. Matt has a Master's degree in Internet Retailing (plus two other Master's degrees in different fields) and specialises in the technical side of ecommerce and marketing.