Modelling JSON data in Scala

Data models are an important design decision in any application, and serialising your data is a key capability most will need in order to communicate with other services. JSON is among the most popular data formats used in APIs regardless of language, so naturally having a good grasp of how to (de)serialise your models to and from JSON is important. Today I'll focus on how to model your JSON data with case classes, and how to handle JSON serialisation. Note that the same principles apply regardless of format, though, and I have used all the same techniques to serialise to/from XML and others as needed.

Scala JSON libraries
Modelling
tl;dr

Scala JSON libraries

Unfortunately, JSON libraries in Scala are a dime a dozen, and any 3 Scala developers will have 4 favourite libraries to use. Fortunately, the guiding principles of almost all of these libraries are very similar:

We convert from String to and from a JSON AST (Abstract Syntax Tree) provided by the library, and from that AST to/from your case class models
We use type classes to add JSON serialisation functionality for the types present in our models, which we make implicitly available
The library provides implicit type classes for basic types like Int, Double, String, List[T], Map[String, T] out of the box
The library provides utilities to make it trivial to derive type classes for case classes

There are several libraries available which all follow this basic design, including Spray JSON, Play JSON, and Circe. I'm going to be focusing on Spray JSON, but the principles I cover will apply regardless of your choice.

Modelling

For the purposes of this post, I'll be dealing with modelling some arbitrary JSON which has been provided by an API we don't control. This means we're focusing on building models which directly represent that data, in some form that's relevant to our imaginary application. We'll leave the issue of transforming or combining this data with other internal models for another day and just stick to representing our JSON payload.

This means that we can't change the JSON – and we'll consider cases where the JSON structure is less than ideal, and how to handle that gracefully without creating internal models which are equally ugly.

Basic modelling decisions

Let's first consider a fairly simple payload about library books:

{
  "isbn": "9780155658110",
  "title": "Nineteen Eighty-Four",
  "author": "George Orwell",
  "lastCheckedOut": "2020-05-02T09:23:11Z",
  "pageCount": 450,
  "reviewRating": 4.6
}

And a sensible model representing this data:

case class Book(
  isbn: String,
  title: String,
  author: String,
  lastCheckedOut: Instant,
  pageCount: Int,
  reviewRating: Double
)

Nothing too complex here, we've largely just picked Scala primitives matching the JSON primitives, with the exception of lastCheckedOut, where we want to parse a Instant. Since dates can come in many formats, we'll need to provide a format for Instant before Book to be able to parse this:

import java.time.Instant
import java.time.format.DateTimeFormatter

import spray.json._

trait BookJsonProtocol extends DefaultJsonProtocol {
  implicit val instantFmt = new JsonFormat[Instant] {
    private val formatter = DateTimeFormatter.ISO_OFFSET_DATE_TIME

    override def write(instant: Instant): JsValue = JsString(formatter.format(instant))
    override def read(v: JsValue): Instant = Instant.from(formatter.parse(v.convertTo[String]))
  }

  implicit val bookFmt = jsonFormat6(Book)
}

object BookJsonProtocol extends BookJsonProtocol

Since our JSON fields all have fairly sensible names and our types are largely simple, our case class format is simply jsonFormat6, having first provided a JsonFormat[Instant]. This format is slighly more direct, to specify ISO8601 datetime format, but still quite straightforward.

Note: We've followed Spray JSON's guidelines of defining a JsonProtocol for our formats and packaging it separately, keeping our type classes separate from the models themselves. In a slightly more complex scenario, we'd go a step further and move our Instant format into a TimeJsonProtocol which we extend, and perhaps make it more generic for different time formats.

Some libraries suggest instead including your formats on the models' companion objects, but this places your serialisation logic into your models, which type classes typically exist to avoid. Note that this also would not be possible for types beyond your control, such as Instant, and you'd have to place them elsewhere regardless.

Stronger types

Depending on what's being modelled, you may also choose to use stronger types for some fields. For example, an ID field representing a particular type of ID might be best represented by its own type, especially if there are specific formatting concerns. That could be as simple as a wrapper class like case class Isbn(value: String), which primarily acts to distinguish the type in your code when ISBNs are used as keys frequently. But it may be more complex depending on encoding: an ISBN is actually comprised of multiple different subgroups, including a checksum digit, and you might want to capture that information.

In other cases, you may have a set of valid possible values appearing in a field, such as in a "status" field or a "type" field, and wish to model that as an enumeration. You might even expect one of multiple different types of objects to appear and require an algebraic data type; that's particularly likely in lists of varied elements.

Let's take a look at an example which includes all of these features and represents a virtual shelf of items a user has put on their "to read/watch" list:

{
  "shelfId": "10821-00001",
  "items": [
    {
      "type": "book",
      "title": "Feast of Souls",
      "author": "C. S. Friedman",
      "status": "todo"
    },
    {
      "type": "movie",
      "title": "Lucky Number Slevin",
      "director": "Paul McGuigan",
      "status": "done"
    }
  ]
}

and the models:

case class ShelfId(userId: String, shelfId: String)

sealed trait MediaItem {
  val title: String
  val status: MediaItem.Status
}
object MediaItem {
  sealed trait Status {
    val value: String
  }
  object Status {
    case object Todo extends Status {
      override val value: String = "todo"
    }
    case object Done extends Status {
      override val value: String = "done"
    }
  }

  case class Book(title: String, author: String, status: Status) extends MediaItem
  case class Movie(title: String, director: String, status: Status) extends MediaItem
}

case class Shelf(shelfId: ShelfId, items: List[MediaItem])

When working with cases like these, the guiding principle is to keep the model representing the fully-parsed, clean model, and deal with transforming to and from the simpler JSON representation in the serialisation code under your XXXJsonProtocol trait.

Note that we've used an algebraic data type, using sealed traits and case objects, for our enumeration value, the same way we did for our different types of media item. Since this is simply a set of enumerable values we could also have used Scala's Enumeration type or another strategy; there are a few possible ways of dealing with enumerations in Scala.

Note also that we drop the type field from items since that information is now provided by the type itself.

Let's take a look at what the serialisation logic might look like for the above example:

import spray.json._

import com.example.blog.models._
import com.example.blog.models.MediaItem._
import com.example.blog.models.MediaItem.Status._

trait ShelfJsonProtocol extends DefaultJsonProtocol {
  implicit val statusFmt = new JsonFormat[Status] {
    override def write(status: Status): JsValue = JsString(status.value)

    override def read(v: JsValue): Status = v.convertTo[String] match {
      case Todo.value => Todo
      case Done.value => Done
      case s => deserializationError(s"Unknown status value '$s'")
    }
  }

  implicit val bookFmt = jsonFormat3(Book)
  implicit val movieFmt = jsonFormat3(Movie)

  implicit val mediaItemFmt = new JsonFormat[MediaItem] {
    override def write(item: MediaItem): JsValue = item match {
      case b: Book => b.toJson
      case m: Movie => m.toJson
    }

    override def read(v: JsValue): MediaItem = {
      // Note that we allow .fields("type") to throw a NoSuchElementException if absent here since
      // spray JSON reports errors as exceptions, but you might handle this more specifically to
      // make your error messaging clearer
      v.asJsObject.fields("type").convertTo[String] match {
        case "book" => v.convertTo[Book]
        case "movie" => v.convertTo[Movie]
        case t => deserializationError(s"Unknown item type '$t'")
      }
    }
  }

  implicit val shelfIdFmt = new JsonFormat[ShelfId] {
    override def write(shelfId: ShelfId): JsValue = {
      JsString(s"${shelfId.userId}-${shelfId.shelfId}")
    }

    override def read(v: JsValue): ShelfId = v.convertTo[String].split("-").toList match {
      case userId :: shelfId :: Nil => ShelfId(userId, shelfId)
      case _ => deserializationError(s"Invalid shelfId format '$v'")
    }
  }

  implicit val shelfFmt = jsonFormat2(Shelf)

}

object ShelfJsonProtocol extends ShelfJsonProtocol

A few things to unpack here:

The status format simply maps the expected set of strings into our status types; since we annotated the string value onto the types, that's even simpler in the write direction
For media items, we define individual formats for Book and Movie, which are simple enough in themselves, and then write a format for the overall MediaItem which will use the type field to figure out which format it should be trying to use, and then delegate to it. Since we have a nice clear type field this is quite easy in this case.
For our shelf ID, we've got two numbers separated by a dash, which we've decided are our user ID and shelf ID respectively, so we've written a short format which parses that information out of the JSON string where needed in order to build our case class.
Finally, with all the complex types now dealt with, the Shelf itself is just a straightforward jsonFormat2.

Cleaning up difficult JSON

I've covered a couple of examples where the JSON data is quite simple to work with, but that's not always the case, and often the readability of your data will be beyond your control. I've run into several examples in the past of quite arcane JSON (or worse, 20-year-old XML) payloads which are difficult to tame.

The key thing to remember is: your internal models don't have to look anything like your JSON.

Here's an example compounding several mistakes I've had to deal with in the past:

{
  "MSGDat11": "Sun, Aug 7 2022",
  "MsgTim12": "17:11 PM",
  "BkDtails13": [
    {
      "Tit118": "Nineteen Eighty-Four"
    },
    {
      "AutNam121": "George Orwell"
    },
    {
      "ISBN117": "9780155658110"
    }
  ],
  "IsFam812": "Yes",
  "ForSal813": "No",
  "NumPag921": "450"
}

This looks like an extreme example but contains several issues I've had to work around in the past:

The field names are unreadable and inconsistent
Several types are poor choices, e.g. string number of pages and especially Yes / No instead of boolean
Our "book details" should be a JSON object but are presented as a list of JSON objects, in indeterminate order, providing one field each.
I've been brief with my example but we may have another 50 top-level fields in this object, deserving of further structure, and which would cause serious pain with case classes

I will mention briefly here that spray json gives you some tools to help with the simpler concern of field names, e.g. jsonFormat("MSGDat11", "MsgTim12", ..., MyModel.apply) but reshaping the JSON data into a sensible structure to parse into your model can be hard work for some of these issues, and I'll leave the details of that as an exercise for the reader, and to be revisited in future.

Regardless, a mistake I've seen a few times is fearing having to write logic into the serialisation layer, such as we've seen in some examples above, and preferring to write models directly equivalent to the JSON. Largely the reasoning here is not wanting to get to grips with how to write custom serialisers, and the argument I've heard for this approach is that minimal serialisation logic makes it easy to see what's being modelled. There are many problems with this approach, though, and with an example as ugly as the one above, several become obvious:

It leaks serialisation details into your code: you have to deal with unreadable, confusing, maybe misspelt field names, along with incorrect types and strange structures
Function types in scala only go up to Function22 – that means that with more than 22 fields in a single case class, or as arguments to a single function call, your library's utilities start breaking down. It'll also thwart your own efforts to refactor to stay DRY – you can't pass around your own 25-argument functions to help map your field names, either. Worse still, this issue may only become evident after hours of committing to the approach or when you need to add new fields to your model months later.

These mistakes are usually made because the fear of writing serialisers outweighs the fear of dealing with bad JSON data, but before deciding you want to avoid any serialisation logic at all, keep in mind exactly how bad data can get.

tl;dr

In brief:

Always write clean, sensibly-typed models for your data, even if the serialised data has to look different.
Decide what your ideal model should look like, and then figure out how to write serialisation logic to get there.
Keep serialisation concerns in your serialisation layer, and don't be afraid to write custom formats if needed.
Consult the docs and make use of helpers to make writing your own formats easier.