Let’s make a receipt text recognizer with the Apple Vision framework.

Overview

There are only three steps. I tested these steps using the playground app.

Sample Code

Download

Prepare the receipt images.

I found a bunch of receipt images on Github. Almost 200 receipts are there.

You can download the receipt images below

Let’s coding.

A few lines of code are needed. I used the playground app. Let’s follow the steps.

Create a playground file in Xcode(File -> New -> Playground). Choose the iOS or macOS platform.

Helper.swift

I add the functions to load receipt image files and draw the bounds which are detected text area.

Helper.swift file for macOS

import Cocoa
import Vision

public func getTestReceiptImageName(_ number: Int) -> String {
    String.init(format: "%d-receipt", number)
}

//https://www.swiftbysundell.com/tips/making-uiimage-macos-compatible/
public extension NSImage {
    var cgImage: CGImage? {
        var proposedRect = CGRect(origin: .zero, size: size)
        return cgImage(forProposedRect: &proposedRect,
                       context: nil,
                       hints: nil)
    }
}

//https://www.udemy.com/course/machine-learning-with-core-ml-2-and-swift
public func visualization(
    _ image: NSImage,
    observations: [VNDetectedObjectObservation],
    boundingBoxColor: NSColor
) -> NSImage {
    var transform = CGAffineTransform.identity
    transform = transform.scaledBy(x: image.size.width, y: image.size.height)

    image.lockFocus()
    let context = NSGraphicsContext.current?.cgContext
    context?.saveGState()

    context?.setLineWidth(2)
    context?.setLineJoin(CGLineJoin.round)
    context?.setStrokeColor(.black)
    context?.setFillColor(boundingBoxColor.cgColor)

    observations.forEach { observation in
        let bounds = observation.boundingBox.applying(transform)
        context?.addRect(bounds)
    }

    context?.drawPath(using: CGPathDrawingMode.fillStroke)
    context?.restoreGState()
    image.unlockFocus()
    return image
}

Helper.swift file for iOS

import Foundation
import UIKit
import Vision

public func getTestReceiptImageName(_ number: Int) -> String {
    String.init(format: "%d-receipt.jpg", number)
}

//https://www.udemy.com/course/machine-learning-with-core-ml-2-and-swift
public func visualization(_ image: UIImage, observations: [VNDetectedObjectObservation]) -> UIImage {
    var transform = CGAffineTransform.identity
        .scaledBy(x: 1, y: -1)
        .translatedBy(x: 1, y: -image.size.height)
    transform = transform.scaledBy(x: image.size.width, y: image.size.height)

    UIGraphicsBeginImageContextWithOptions(image.size, true, 0.0)
    let context = UIGraphicsGetCurrentContext()

    image.draw(in: CGRect(origin: .zero, size: image.size))
    context?.saveGState()

    context?.setLineWidth(2)
    context?.setLineJoin(CGLineJoin.round)
    context?.setStrokeColor(UIColor.black.cgColor)
    context?.setFillColor(red: 0, green: 1, blue: 0, alpha: 0.3)

    observations.forEach { observation in
        let bounds = observation.boundingBox.applying(transform)
        context?.addRect(bounds)
    }

    context?.drawPath(using: CGPathDrawingMode.fillStroke)
    context?.restoreGState()
    let resultImage = UIGraphicsGetImageFromCurrentImageContext()
    UIGraphicsEndImageContext()
    return resultImage!
}

Vision Framework

A Vision framework enables me to detect text in the receipt image. Also, I can get strings after processing the OCR.

Step 1. Load a receipt image from Resource folder

//Step 1. Load a receipt image from Resource folder
import Vision
import Cocoa

let image = NSImage(imageLiteralResourceName: getTestReceiptImageName(1000))

Step 2. Declare a VNRecognizeTextRequest to detect text in the receipt image

let recognizeTextRequest = VNRecognizeTextRequest  { (request, error) in
    guard let observations = request.results as? [VNRecognizedTextObservation] else {
        print("Error: \(error! as NSError)")
        return
    }
    for currentObservation in observations {
        let topCandidate = currentObservation.topCandidates(1)
        if let recognizedText = topCandidate.first {
            //OCR Results
            print(recognizedText.string)
        }
    }
    let fillColor: NSColor = NSColor.green.withAlphaComponent(0.3)
    let result = visualization(image, observations: observations, boundingBoxColor: fillColor)
}
recognizeTextRequest.recognitionLevel = .accurate

Step 3. Processing the receipt image using VNImageRequestHandler.

func request(_ image: NSImage) {
    guard let cgImage = image.cgImage else {
        return
    }
    let handler = VNImageRequestHandler(cgImage: cgImage, options: [:])
    DispatchQueue.global(qos: .userInitiated).async {
        do {
            try handler.perform([recognizeTextRequest])
        }
        catch let error as NSError {
            print("Failed: \(error)")
        }
    }
}
request(image)

Step 4. Check the result

To open the result image, click the QuickLook. — To open the result image, click the **QuickLook**.

Can Vision Framework detect other languages?

Yes, But It supports a few languages. Take a look at the supported languages.

let supportedLanguages = try VNRecognizeTextRequest.supportedRecognitionLanguages(for: .accurate, revision: VNRecognizeTextRequestRevision2)
print("\(supportedLanguages.count) Languages are available. -> \(supportedLanguages)")

//8 Languages are available. -> ["en-US", "fr-FR", "it-IT", "de-DE", "es-ES", "pt-BR", "zh-Hans", "zh-Hant"]

NaturalLanguage framework

You can detect the lexical class of string using the NaturalLanguage framework. I modified step2 codes to print out the tag of string.

import NaturalLanguage
let tagger = NLTagger(tagSchemes: [.nameTypeOrLexicalClass])
let recognizeTextRequest = VNRecognizeTextRequest  { (request, error) in
    guard let observations = request.results as? [VNRecognizedTextObservation] else {
        print("Error: \(error! as NSError)")
        return
    }
    let ocrResults = observations.compactMap { $0.topCandidates(1).first?.string }.joined(separator: "\n")
    tagger.string = ocrResults
    tagger.enumerateTags(in: ocrResults.startIndex..<ocrResults.endIndex, unit: NLTokenUnit.word, scheme: NLTagScheme.nameTypeOrLexicalClass, options: [.omitPunctuation, .omitWhitespace]) { tag, range in
        print("Tag: \(tag?.rawValue ?? "unknown") -> \(ocrResults[range])")
        return true
    }

    let fillColor: NSColor = NSColor.green.withAlphaComponent(0.3)
    let result = visualization(image, observations: observations, boundingBoxColor: fillColor)
}
recognizeTextRequest.recognitionLevel = .accurate

It gives me more information about Noun, Number, PlaceName, PersonalName, and more.

Conclusion

I’m very impressed with the Vision framework. I can implement the OCR features in just a few lines of code. Also, It works on the device. (We don’t need to connect to the internet)

I hope Apple supports more languages like Korean, Japanese, Thailand, and more.

Shawn