If you regularly work with ecommerce data, you’re likely to have encountered PHP serialized arrays or objects. Serialization is a process used to take a complex data structure, such as a PHP array, and convert it to a format that can be neatly stored in a single database column, and then quickly retrieved and deserialized to obtain the original array.
Serialization is a popular technique for PHP developers, because it prevents the need for them to create complex database schemas, and update them every time a new parameter is added somewhere. The downside is that serialized data is a monumental pain in the arse for data scientists, because it can’t be queried or searched like regular columnar data.
In this project, I’ll show you how you can unserialize serialized PHP arrays using Python and convert them into regular Python dictionaries that you can parse and manipulate. We’ll look at some real ecommerce data from a PHP-based ecommerce platform and unserialize the data and convert the serialized data to new columns in a Pandas dataframe for easier analysis.
First, open up a Jupyter notebook and import pandas
and phpserialize
. This Python package allows you to serialize
data so it’s in the same format that PHP uses, and deserialize
serialized data created by PHP’s serialize()
function. To install phpserialize
enter pip3 install phpserialize
in your terminal.
import pandas as pd
from phpserialize import serialize
from phpserialize import unserialize
from phpserialize import phpobject
pd.set_option('max_colwidth', 300)
Next, load up your data containing your PHP serialized arrays. The data I’m working with are from an ecommerce platform and record shipment processing data.
df = pd.read_csv('logs.csv')
df = df[['id', 'data']]
df.head(3)
id | data | |
---|---|---|
0 | 69807 | a:5:{s:9:"processor";s:3:"ABC";s:15:"shipment_number";s:9:"AA2994755";s:15:"request_from_ip";s:14:"91.152.190.166";s:21:"request_from_location";s:8:"Internal";s:7:"message";s:70:"Created a new shipment AA2994755 to be collected by ABC.";} |
1 | 69804 | a:5:{s:9:"processor";s:3:"ABC";s:15:"shipment_number";s:9:"AA2994741";s:15:"request_from_ip";s:14:"91.152.190.166";s:21:"request_from_location";s:8:"Internal";s:7:"message";s:70:"Created a new shipment AA2994741 to be collected by ABC.";} |
2 | 69801 | a:5:{s:9:"processor";s:3:"ABC";s:15:"shipment_number";s:9:"AA2994738";s:15:"request_from_ip";s:14:"91.152.190.166";s:21:"request_from_location";s:8:"Internal";s:7:"message";s:70:"Created a new shipment AA2994738 to be collected by ABC.";} |
df.shape
(100, 2)
Unserializing PHP arrays and turning them into Python dictionaries is made much simpler via the phpserialize
package. However, there are still some hurdles to overcome. To smooth out the process, I’ve written a little function to handle the important steps.
This will take a PHP serialized array and byte encode it, then convert the byte encoded dictionary (identified by the b'
character before each key or value) into a regular Python dictionary.
def php_serialized_to_dict(serialized):
dict_bytes = unserialize(bytes(serialized, 'utf-8'), object_hook=phpobject)
dict_regular = {
key.decode(): val.decode() if isinstance(val, bytes) else val
for key, val in dict_bytes.items()
}
return dict_regular
data = php_serialized_to_dict(df.data[0])
data
{'processor': 'ABC',
'shipment_number': 'AA2994755',
'request_from_ip': '91.152.190.166',
'request_from_location': 'Internal',
'message': 'Created a new shipment <b>AA2994755</b> to be collected by <b>ABC</b>.'}
To go through our Pandas dataframe and unserialize every serialized array in the data
column we can use the apply()
function and lambda
to call our php_serialized_to_dict()
function and return the unserialized dictionary to a new dataframe column called data_unserialized
.
df['data_unserialized'] = df.apply(lambda x: php_serialized_to_dict(x.data), axis=1)
df.head()
id | data | data_unserialized | |
---|---|---|---|
0 | 69807 | a:5:{s:9:"processor";s:3:"ABC";s:15:"shipment_number";s:9:"AA2994755";s:15:"request_from_ip";s:14:"91.152.190.166";s:21:"request_from_location";s:8:"Internal";s:7:"message";s:70:"Created a new shipment AA2994755 to be collected by ABC.";} | {'processor': 'ABC', 'shipment_number': 'AA2994755', 'request_from_ip': '91.152.190.166', 'request_from_location': 'Internal', 'message': 'Created a new shipment AA2994755 to be collected by ABC.'} |
1 | 69804 | a:5:{s:9:"processor";s:3:"ABC";s:15:"shipment_number";s:9:"AA2994741";s:15:"request_from_ip";s:14:"91.152.190.166";s:21:"request_from_location";s:8:"Internal";s:7:"message";s:70:"Created a new shipment AA2994741 to be collected by ABC.";} | {'processor': 'ABC', 'shipment_number': 'AA2994741', 'request_from_ip': '91.152.190.166', 'request_from_location': 'Internal', 'message': 'Created a new shipment AA2994741 to be collected by ABC.'} |
2 | 69801 | a:5:{s:9:"processor";s:3:"ABC";s:15:"shipment_number";s:9:"AA2994738";s:15:"request_from_ip";s:14:"91.152.190.166";s:21:"request_from_location";s:8:"Internal";s:7:"message";s:70:"Created a new shipment AA2994738 to be collected by ABC.";} | {'processor': 'ABC', 'shipment_number': 'AA2994738', 'request_from_ip': '82.152.190.166', 'request_from_location': 'Internal', 'message': 'Created a new shipment AA2994738 to be collected by ABC.'} |
3 | 69798 | a:5:{s:9:"processor";s:3:"ABC";s:15:"shipment_number";s:9:"AA2994724";s:15:"request_from_ip";s:14:"91.152.190.166";s:21:"request_from_location";s:8:"Internal";s:7:"message";s:70:"Created a new shipment AA2994724 to be collected by ABC.";} | {'processor': 'ABC', 'shipment_number': 'AA2994724', 'request_from_ip': '82.152.190.166', 'request_from_location': 'Internal', 'message': 'Created a new shipment AA2994724 to be collected by ABC.'} |
4 | 69795 | a:5:{s:9:"processor";s:3:"ABC";s:15:"shipment_number";s:9:"AA2994715";s:15:"request_from_ip";s:14:"91.152.190.166";s:21:"request_from_location";s:8:"Internal";s:7:"message";s:70:"Created a new shipment AA2994715 to be collected by ABC.";} | {'processor': 'ABC', 'shipment_number': 'AA2994715', 'request_from_ip': '91.152.190.166', 'request_from_location': 'Internal', 'message': 'Created a new shipment AA2994715 to be collected by ABC.'} |
The above is OK. We’d be able to grab each dictionary stored in the data_serialized
column and examine or parse it. However, it’s just as clunky as serialized data, so we’ll now create a new dataframe in which each of the dictionary keys is turned into a new column in the dataframe.
This can be done very quickly using the json_normalize()
function. It’s extremely useful for stuff like this, and it gives us a Pandas dataframe containing all the data in columns we can search, filter, and subset with ease.
df_unserialized = pd.json_normalize(df['data_unserialized'])
df_unserialized.head()
processor | shipment_number | request_from_ip | request_from_location | message | |
---|---|---|---|---|---|
0 | ABC | AA2994755 | 91.152.190.166 | Internal | Created a new shipment AA2994755 to be collected by ABC. |
1 | ABC | AA2994741 | 91.152.190.166 | Internal | Created a new shipment AA2994741 to be collected by ABC. |
2 | ABC | AA2994738 | 91.152.190.166 | Internal | Created a new shipment AA2994738 to be collected by ABC. |
3 | ABC | AA2994724 | 91.152.190.166 | Internal | Created a new shipment AA2994724 to be collected by ABC. |
4 | ABC | AA2994715 | 91.152.190.166 | Internal | Created a new shipment AA2994715 to be collected by ABC. |
Matt Clarke, Friday, March 12, 2021