Decoding HTML character references in Swift

When all you need is plain text

Originally posted on January 22, 2016
Updated on

Introduction

When playing around with the BoardGameGeek XML API, one of the (many) things that bothered me was that the text it returns contains HTML character references, like — or 
. I didn't feel like including WebViews in my application just to render some text, so I started looking for ways to decode these references. All I could find were some outdated and incomplete Objective-C solutions, so I decided to write my own.

HTML character references

HTML character references come in three forms:

  • &name;, where name is a character name (also called an entity),
  • &#d;, where d is a decimal code point,
  • &#xh;, where h is a hexadecimal code point.

A complete list of character names is included in the HTML 5 specification. This list of character names and their corresponding values is also available in JSON.

Cleaning up

The first thing I did was clean up that JSON file. It includes mappings for character references that are missing the trailing semi-colon (I really didn't feel like parsing those) and contains more information that I needed. All I really needed was a [String: String] dictionary mapping entities to their corresponding Unicode characters. I wrote the following Bash script to take care of that:

#!/bin/bash

# Restore the opening brace that will be removed by the filter command.
echo '{' > output.json

# Filter out entities that do not contain a trailing semi-colon.
sed -En 's/&[a-zA-Z]+;/&/p' < entities.json |

# Simplify the values from a full object to a single string.
sed -E 's/{.*("[\\u0-9A-F]+") }/\1/' >> output.json

# Restore the closing brace that was removed by the filter command.
echo '}' >> output.json

Note that this script is made for OS X (or FreeBSD). On Linux, replace the E flag with r. If you don't feel like running the script yourself, you can find the resulting output (renamed to entities.json) here.

Find and replace

Decoding the character references was pretty easy once I had a complete list of entities. The following extension on String contains a method that does a single pass over the text and decodes every character reference it encounters. The code should be self-explanatory. Note that, other than reading in the entities.json file, this code is pure Swift and does not rely on Foundation.

import Foundation

extension String {

    mutating func decodeHtmlCharacterReferences() {
        var decodedString = ""
        var reference = ""
        var inReference = false
        for character in self.characters {
            if inReference {
                reference.append(character)
                if character == ";" {
                    inReference = false
                    if let entity = entities[reference] {
                        decodedString.appendContentsOf(entity)
                    } else if reference.hasPrefix("&#x") {
                        let start = reference.startIndex.advancedBy(3)
                        let end = reference.endIndex.predecessor()
                        if let codePoint = Int(reference.substringWithRange(start..<end), radix: 16) {
                            decodedString.append(Character(UnicodeScalar(codePoint)))
                        }
                    } else if reference.hasPrefix("&#") {
                        let start = reference.startIndex.advancedBy(2)
                        let end = reference.endIndex.predecessor()
                        if let codePoint = Int(reference.substringWithRange(start..<end)) {
                            decodedString.append(Character(UnicodeScalar(codePoint)))
                        }
                    }
                }
            } else if character == "&" {
                reference = "&"
                inReference = true
            } else {
                decodedString.append(character)
            }
        }
        self = decodedString
    }
}

private let entities: [String: String] = {
    let fileName = NSBundle.mainBundle().pathForResource("entities", ofType: "json")!
    let fileData = NSData(contentsOfFile: fileName)!
    return try! NSJSONSerialization.JSONObjectWithData(fileData, options: NSJSONReadingOptions()) as! [String: String]
}()

That's it! Free free to reuse this code and modify it to fit your needs.

If you've enjoyed my work or found it helpful, please consider becoming a patron. Your support helps me free up time to work on my books and projects.