How to unserialize serialized PHP arrays using Python

PHP serialized arrays and objects are common in ecommerce database schemas. This is how you unserialize them in Python.

How to unserialize serialized PHP arrays using Python
Picture by Luca Bravo, Unsplash.
7 minutes to read

If you regularly work with ecommerce data, you’re likely to have encountered PHP serialized arrays or objects. Serialization is a process used to take a complex data structure, such as a PHP array, and convert it to a format that can be neatly stored in a single database column, and then quickly retrieved and deserialized to obtain the original array.

Serialization is a popular technique for PHP developers, because it prevents the need for them to create complex database schemas, and update them every time a new parameter is added somewhere. The downside is that serialized data is a monumental pain in the arse for data scientists, because it can’t be queried or searched like regular columnar data.

In this project, I’ll show you how you can unserialize serialized PHP arrays using Python and convert them into regular Python dictionaries that you can parse and manipulate. We’ll look at some real ecommerce data from a PHP-based ecommerce platform and unserialize the data and convert the serialized data to new columns in a Pandas dataframe for easier analysis.

Load the packages

First, open up a Jupyter notebook and import pandas and phpserialize. This Python package allows you to serialize data so it’s in the same format that PHP uses, and deserialize serialized data created by PHP’s serialize() function. To install phpserialize enter pip3 install phpserialize in your terminal.

import pandas as pd
from phpserialize import serialize
from phpserialize import unserialize
from phpserialize import phpobject
pd.set_option('max_colwidth', 300)

Load the data

Next, load up your data containing your PHP serialized arrays. The data I’m working with are from an ecommerce platform and record shipment processing data.

df = pd.read_csv('logs.csv')
df = df[['id', 'data']]
df.head(3)
id data
0 69807 a:5:{s:9:"processor";s:3:"ABC";s:15:"shipment_number";s:9:"AA2994755";s:15:"request_from_ip";s:14:"91.152.190.166";s:21:"request_from_location";s:8:"Internal";s:7:"message";s:70:"Created a new shipment AA2994755 to be collected by ABC.";}
1 69804 a:5:{s:9:"processor";s:3:"ABC";s:15:"shipment_number";s:9:"AA2994741";s:15:"request_from_ip";s:14:"91.152.190.166";s:21:"request_from_location";s:8:"Internal";s:7:"message";s:70:"Created a new shipment AA2994741 to be collected by ABC.";}
2 69801 a:5:{s:9:"processor";s:3:"ABC";s:15:"shipment_number";s:9:"AA2994738";s:15:"request_from_ip";s:14:"91.152.190.166";s:21:"request_from_location";s:8:"Internal";s:7:"message";s:70:"Created a new shipment AA2994738 to be collected by ABC.";}
df.shape
(100, 2)

Unserialize serialized PHP arrays

Unserializing PHP arrays and turning them into Python dictionaries is made much simpler via the phpserialize package. However, there are still some hurdles to overcome. To smooth out the process, I’ve written a little function to handle the important steps.

This will take a PHP serialized array and byte encode it, then convert the byte encoded dictionary (identified by the b' character before each key or value) into a regular Python dictionary.

def php_serialized_to_dict(serialized):

    dict_bytes = unserialize(bytes(serialized, 'utf-8'), object_hook=phpobject)

    dict_regular = {
        key.decode(): val.decode() if isinstance(val, bytes) else val
        for key, val in dict_bytes.items()
    }

    return dict_regular
data = php_serialized_to_dict(df.data[0])
data
{'processor': 'ABC',
 'shipment_number': 'AA2994755',
 'request_from_ip': '91.152.190.166',
 'request_from_location': 'Internal',
 'message': 'Created a new shipment <b>AA2994755</b> to be collected by <b>ABC</b>.'}

Unserialize the serialized dataframe columns

To go through our Pandas dataframe and unserialize every serialized array in the data column we can use the apply() function and lambda to call our php_serialized_to_dict() function and return the unserialized dictionary to a new dataframe column called data_unserialized.

df['data_unserialized'] = df.apply(lambda x: php_serialized_to_dict(x.data), axis=1)
df.head()
id data data_unserialized
0 69807 a:5:{s:9:"processor";s:3:"ABC";s:15:"shipment_number";s:9:"AA2994755";s:15:"request_from_ip";s:14:"91.152.190.166";s:21:"request_from_location";s:8:"Internal";s:7:"message";s:70:"Created a new shipment AA2994755 to be collected by ABC.";} {'processor': 'ABC', 'shipment_number': 'AA2994755', 'request_from_ip': '91.152.190.166', 'request_from_location': 'Internal', 'message': 'Created a new shipment AA2994755 to be collected by ABC.'}
1 69804 a:5:{s:9:"processor";s:3:"ABC";s:15:"shipment_number";s:9:"AA2994741";s:15:"request_from_ip";s:14:"91.152.190.166";s:21:"request_from_location";s:8:"Internal";s:7:"message";s:70:"Created a new shipment AA2994741 to be collected by ABC.";} {'processor': 'ABC', 'shipment_number': 'AA2994741', 'request_from_ip': '91.152.190.166', 'request_from_location': 'Internal', 'message': 'Created a new shipment AA2994741 to be collected by ABC.'}
2 69801 a:5:{s:9:"processor";s:3:"ABC";s:15:"shipment_number";s:9:"AA2994738";s:15:"request_from_ip";s:14:"91.152.190.166";s:21:"request_from_location";s:8:"Internal";s:7:"message";s:70:"Created a new shipment AA2994738 to be collected by ABC.";} {'processor': 'ABC', 'shipment_number': 'AA2994738', 'request_from_ip': '82.152.190.166', 'request_from_location': 'Internal', 'message': 'Created a new shipment AA2994738 to be collected by ABC.'}
3 69798 a:5:{s:9:"processor";s:3:"ABC";s:15:"shipment_number";s:9:"AA2994724";s:15:"request_from_ip";s:14:"91.152.190.166";s:21:"request_from_location";s:8:"Internal";s:7:"message";s:70:"Created a new shipment AA2994724 to be collected by ABC.";} {'processor': 'ABC', 'shipment_number': 'AA2994724', 'request_from_ip': '82.152.190.166', 'request_from_location': 'Internal', 'message': 'Created a new shipment AA2994724 to be collected by ABC.'}
4 69795 a:5:{s:9:"processor";s:3:"ABC";s:15:"shipment_number";s:9:"AA2994715";s:15:"request_from_ip";s:14:"91.152.190.166";s:21:"request_from_location";s:8:"Internal";s:7:"message";s:70:"Created a new shipment AA2994715 to be collected by ABC.";} {'processor': 'ABC', 'shipment_number': 'AA2994715', 'request_from_ip': '91.152.190.166', 'request_from_location': 'Internal', 'message': 'Created a new shipment AA2994715 to be collected by ABC.'}

Split the dictionary into separate columns

The above is OK. We’d be able to grab each dictionary stored in the data_serialized column and examine or parse it. However, it’s just as clunky as serialized data, so we’ll now create a new dataframe in which each of the dictionary keys is turned into a new column in the dataframe.

This can be done very quickly using the json_normalize() function. It’s extremely useful for stuff like this, and it gives us a Pandas dataframe containing all the data in columns we can search, filter, and subset with ease.

df_unserialized = pd.json_normalize(df['data_unserialized'])
df_unserialized.head()
processor shipment_number request_from_ip request_from_location message
0 ABC AA2994755 91.152.190.166 Internal Created a new shipment AA2994755 to be collected by ABC.
1 ABC AA2994741 91.152.190.166 Internal Created a new shipment AA2994741 to be collected by ABC.
2 ABC AA2994738 91.152.190.166 Internal Created a new shipment AA2994738 to be collected by ABC.
3 ABC AA2994724 91.152.190.166 Internal Created a new shipment AA2994724 to be collected by ABC.
4 ABC AA2994715 91.152.190.166 Internal Created a new shipment AA2994715 to be collected by ABC.

Matt Clarke, Friday, March 12, 2021

Matt Clarke Matt is an Ecommerce and Marketing Director who uses data science to help in his work. Matt has a Master's degree in Internet Retailing (plus two other Master's degrees in different fields) and specialises in the technical side of ecommerce and marketing.