Using Python in an Azure Function to crop a PDF file

Using Python in an Azure Function to crop a PDF file

ยท

2 min read

I was looking for a way to crop a pdf to visible bounds. I wanted to remove the unnecessary empty space surrounding the contents of the PDF.

I knew there was a Linux tool called pdfcrop that worked well for my requirement (it uses Ghostscript behind the scenes). The thing is, how could I call pdfcrop if my service (that needs the crop) is a .NET WebAPI service?

We have classes like ProcessStartInfo and Process in .NET, that allow us to start a process on the host. But it felt hacky to use that in a WebAPI service, and I was not sure if it could leave zombie processes in the OS.

I wanted to have something separate from the service, so I decided to try an Azure Function with Python. A Function has the benefit of being consumption-based, event-driven, and it can do rapid bursts of stateless custom code at scale.

My proof of concept

The first step was to initialize the Function App:

func init PdfCropFunctionApp --docker --worker-runtime python

Then the Function itself:

cd .\PdfCropFunctionApp\
func new --name PdfCropFunction --template "HTTP trigger" --language python

The Dockerfile is necessary, because we need to install pdfcrop there:

RUN apt-get update
RUN apt-get install -y texlive-extra-utils

Finally the scritpt, __init__.py:

import logging
import base64
import uuid
import subprocess
import json

import azure.functions as func


def main(req: func.HttpRequest) -> func.HttpResponse:
    logging.info('Python HTTP trigger function processed a request.')

    #read base64 from request
    req_body = req.get_json()
    pdf_base64 = req_body.get('pdf_base64')

    #convert to bytes
    bytes = base64.b64decode(pdf_base64)   

    #generate a unique Id
    uniqueId = str(uuid.uuid4())[:8]

    logging.info(f'Length in bytes: {len(bytes)}. Unique Id: {uniqueId}')

    #save the file
    with open(f"/tmp/pdf_{uniqueId}.pdf", "wb") as binary_file:           
        binary_file.write(bytes)

    #call pdfcrop
    subprocess.call(f"/usr/bin/pdfcrop --margins '5 5 5 5' /tmp/pdf_{uniqueId}.pdf", shell=True)

    #get the resulting file as base64
    with open(f"/tmp/pdf_{uniqueId}-crop.pdf", "rb") as croppedFile:
        pdf_base64_cropped = base64.b64encode(croppedFile.read())

    #return the cropped PDF as base64
    croppedJson = { "pdf_base64_cropped": (pdf_base64_cropped.decode("utf-8")) }

    return func.HttpResponse(             
             status_code=200,
             mimetype="application/json",
             body=json.dumps(croppedJson),
        )

It can be built and run (locally) as usual:

docker build -t pdfcropimage .

docker run -p 8080:80 -it --rm pdfcropimage

The request should be sent like this:

POST http://localhost:8080/api/PdfCropFunction

{
    "pdf_base64": "JVBERi0xLjcKCjQgMC..."
}

And the response will come in a similar format:

{
    "pdf_base64_cropped": "JVBERi0xLjcKJdDUxdgKN..."
}

This is what it looks like ๐Ÿ˜‰

Original PDFCropped PDF
ย