Automating Recalculation of Excel Files in Python And Databricks

On a weekly basis, we exchange large amounts of data with our clients. This can for instance be information on different SKUs, such as how long they have been in stock, how they are selling, and so on.  

 

For our customers, it may be easier to receive this data in Excel, as excel is a tool a lot of people are familiar with. Most of us already have it installed on our computer, which means no additional solutions or licenses have to be bought. A benefit Excel has over CSV is that it has a user interface and you do not need to spend a lot time developing. On top of this, we can also use different styles of formatting, colors and visualizations in Excel, so that it looks better and makes the data easier to understand for our clients 

 

 

Taking into account that we want to build some kind of a product, we actually need to automate the process of Excel file creation, calculation of the formulas and saving the files on the client side. As for some customers there are lots of different departmens receiving these types of files, we need to automate this process of using Excel and VBA (the programming language in Excel), because doing it manually would take too long. The automation would entail the process of Excel file creation, calculation of the formulas and saving the files on the client side. 

 

Challenges

At aioneers we use Databricks for the automation of data science processes, and therefore we wanted to do the automation of Excel file creation on Databricks.  

Databricks, however, has some additional challenges that are specific to the platform. The first thing that was immediately evident was that it is on Linux. This means that we don‘t have any version of Excel on the platform, because Excel does not run on Linux. However, there is a possibility to use open source software that we can use to calculate the Excel formulas in the Excel sheets. As we worked on this project, we came across some challenges that we did not expect at the start. The code we’re sharing in this article is only relevant for the working solution that we found. The code that will be written here will only be relevant to the solution that works. 

 

How to make it work

At first we followed the instructions to install Open Office on a Databricks server. However, we soon figured out that OpenOffice uses Python 2 instead of Python 3Python 3 is the current standard, and we did not want to use an old and unsupported version of Python. This made us switch to LibreOffice. 

 

All the code listed below is written and executed in Databricks notebooks. 

 

Installation of LibreOffice: 

 

%sh

sudo apt-get install libreoffice -y

 

In order to start LibreOffice from Python we needed to install additional package python3-uno: 

 

%sh 

sudo apt-get --yes --force-yes install  python3-uno

 

After that we checked what packages we have installed: 

 

%sh

dpkg -l python3-uno libreoffice libreoffice-calc o

 

This command confirmed that we have all the needed packages installed. The output of the command: 

 

||/ Name             Version                   Architecture Description 

+++-================-=========================-============-======================================== 

ii  libreoffice      1:6.0.7-0ubuntu0.18.04.10 amd64        office productivity suite (metapackage) 

ii  libreoffice-calc 1:6.0.7-0ubuntu0.18.04.10 amd64        office productivity suite -- spreadsheet 

ii  python3-uno      1:6.0.7-0ubuntu0.18.04.10 amd64        Python-UNO bridge 

 

According to the manual, after installing python3-uno packages, the uno package should be available in Python. But executing import uno tells us that the library is not found.  

Why did it happen? It appears that there are two different Python installations on the system. One is the one that is used by Databricks (/databricks/python/bin/python) and the other one, installed by the system by default (/usr/bin/python3). The installation of python3-uno package only installs the uno package for /usr/bin/python3, and not /databricks/python/bin/python that is actually used on Databricks.

 

After thinking about this for a while, we decided to create a Python script file and execute it with the right Python (/usr/bin/python3). The Idea of the following script is to open the Excel file, recalculate all the formulas and close it. 

 

We start this script by defining our variables and imports: 

 

temp_file = "temp.xlsx" 

py_file_string = """ 

temp_file = '""" + temp_file + """' 

inputFile = temp_file 

 

import time 

import os 

import uno 

from com.sun.star.beans import PropertyValuecode> 

 

import subprocess 

 

Then we start LibreOffice and wait 3 seconds for it to be up and running: 

 

os.popen('/usr/lib/libreoffice/program/soffice --headless --accept="socket,host=localhost,port=2002;urp;StarOffice.ServiceManager"') 

time.sleep(3)  # Sleep for 3 seconds 

 

Then we need to get the desktop class. Contrary to what is written in most manuals for LibreOffice, desktop should be assigned from a Service Manager. 

 

# get the uno component context from the PyUNO runtime 

localContext = uno.getComponentContext() 

 

# create the UnoUrlResolver 

resolver = localContext.ServiceManager.createInstanceWithContext( 

    "com.sun.star.bridge.UnoUrlResolver", localContext) 

 

# connect to the running office 

context = resolver.resolve("uno:socket,host=localhost,port=2002;urp;StarOffice.ComponentContext") 

smgr = context.ServiceManager 

desktop = smgr.createInstanceWithContext( "com.sun.star.frame.Desktop",context) 

 

Now we need to create the URL that LibreOffice understands from the filename that we have: 

 

def urlify(path): 

    return uno.systemPathToFileUrl(os.path.realpath(path)) 

 

thedocurl = urlify(temp_file) 

 

# the new filename URL 

newdocurl = urlify(temp_file[:-5]+"_new.xlsx") 

 

print(thedocurl) 

 

Now open the document from the URL: 

 

document = desktop.loadComponentFromURL(thedocurl,"_blank",0,()) 

 

Recalculate all the formulas: 

 

document.calculateAll() 

Store as Excel .xlsx format: 

p = PropertyValue() 

p.Name = 'FilterName' 

p.Value = 'Calc Office Open XML' 

document.storeAsURL(newdocurl, (p,)) 

 

Close the document in LibreOffice: 

 

document.dispose() 

 

After executing all of the lines above the Python in Databricks froze and never finished executing. The solution to this problem was to kill the process of LibreOffice. This was the final step that finally released Python from an endless loop: 

 

subprocess.run(['/usr/bin/killall', 'soffice.bin'], stdout=subprocess.PIPE).stdout.decode('utf-8') 

 

Finishing the code with ending the multiline string and running the process: 

 

""" 

 

with open("init.py", "w") as f: 

    f.write(py_file_string) 

     

# function to run and print output from shell 

def run_process_func(exe): 

    p = subprocess.Popen(exe, stdout=subprocess.PIPE, stderr=subprocess.STDOUT) 

    while True: 

        # returns None while subprocess is running 

        retcode = p.poll() 

        line = p.stdout.readline().decode("utf8") 

        yield line 

        if retcode is not None: 

            break 

 

def run_process(exe): 

    print(f"running cmd: {exe}") 

    for line in run_process_func(shlex.split(exe)): 

        print(line, end='') 

 

run_process("/usr/bin/python3 init.py") 

 

In the end the code allowed us to open the Excel code, recalculate all the formulas and save the file. 

 

Final Code 

 

First cell 

 

%sh 

sudo apt-get install libreoffice -y 

sudo apt-get --yes --force-yes install  python3-uno 

 

Second cell 

 

temp_file = "temp.xlsx" 

py_file_string = """ 

temp_file = '""" + temp_file + """' 

inputFile = temp_file 

 

import time 

import os 

import uno 

from com.sun.star.beans import PropertyValue 

 

import subprocess 

 

os.popen('/usr/lib/libreoffice/program/soffice --headless --accept="socket,host=localhost,port=2002;urp;StarOffice.ServiceManager"') 

time.sleep(3)  # Sleep for 3 seconds 

 

# get the uno component context from the PyUNO runtime 

localContext = uno.getComponentContext() 

 

# create the UnoUrlResolver 

resolver = localContext.ServiceManager.createInstanceWithContext( 

    "com.sun.star.bridge.UnoUrlResolver", localContext) 

 

# connect to the running office 

context = resolver.resolve("uno:socket,host=localhost,port=2002;urp;StarOffice.ComponentContext") 

smgr = context.ServiceManager 

desktop = smgr.createInstanceWithContext( "com.sun.star.frame.Desktop",context) 

 

def urlify(path): 

    return uno.systemPathToFileUrl(os.path.realpath(path)) 

 

thedocurl = urlify(temp_file) 

 

# the new filename URL 

newdocurl = urlify(temp_file[:-5]+"_new.xlsx") 

 

print(thedocurl) 

 

document = desktop.loadComponentFromURL(thedocurl,"_blank",0,()) 

document.calculateAll() 

 

p = PropertyValue() 

p.Name = 'FilterName' 

p.Value = 'Calc Office Open XML' 

document.storeAsURL(newdocurl, (p,)) 

 

document.dispose() 

 

subprocess.run(['/usr/bin/killall', 'soffice.bin'], stdout=subprocess.PIPE).stdout.decode('utf-8') 

""" 

 

with open("init.py", "w") as f: 

   f.write(py_file_string) 

     

# function to run and print output from shell 

def run_process_func(exe): 

    p = subprocess.Popen(exe, stdout=subprocess.PIPE, stderr=subprocess.STDOUT) 

    while True: 

        # returns None while subprocess is running 

        retcode = p.poll() 

        line = p.stdout.readline().decode("utf8") 

        yield line 

        if retcode is not None: 

            break 

 

def run_process(exe): 

    print(f"running cmd: {exe}") 

    for line in run_process_func(shlex.split(exe)): 

        print(line, end='') 

 

run_process("/usr/bin/python3 init.py") 

 

Sergey Nikonov
By Sergey Nikonov Dec 29, 2020 4:58:00 PM
Supply Chain Brief