Using Scala to Read Really, Really Large Files – Part 4: Akka Streams

Akka Streams is an implementation of Reactive Streams, which intends to bring sanity to handling asynchronous streams of large quantities of data.

As a bonus, they provide both a Scala and a Java DSL, which can be handy as moving between Java and Scala is really only easy when calling Java methods from Scala code.

Implementation

object Akka extends FileReader {
  override def consume(path: Path): Result = {
    implicit val executionContext:  ExecutionContext  = ExecutionContext.global
    implicit val actorSystem:       ActorSystem       = ActorSystem()
    implicit val actorMaterializer: ActorMaterializer = ActorMaterializer()

    try {
      Await.result(
        FileIO
          .fromPath(path)
          .via(Framing.delimiter(ByteString("\n"),  512).map(_.utf8String))
          .runFold(LineMetricsAccumulator.empty)(_ addLine _)
          .map(_.asResult),
        Duration.Inf
      )
    } finally {
      actorMaterializer.shutdown()
      Await.result(actorSystem.terminate(), Duration.Inf)
      ()
    }
  }

  override def description: String = "Akka Streams";
}

Note: if you’re curious about the empty parens in the finally block, that’s a consequence of disabling value discarding to avoid breaking Scala’s type system.

Creation, transformation, and materialization of the stream are nicely decoupled, which makes modularization, abstraction, and testing much easier than any of the libraries we’ve considered up to this point.

Akka is very general, but what they do provide is generally very easy to work with – provided you stick to the provided DSL. You can implement custom stages when the public DSL falls short, but their internal API is fully mutable and difficult to get right. This is partially, but not completely, offset by an excellent testkit.

While you can go quite a ways without creating a custom stage, it’ll almost inevitably come up eventually, between the difficulty this represents and how fiddly it can be to properly close out the underlying Actor System, ergonomics comes up as a draw.

Safety 😀

The underlying implementation using an effectively untyped Actor system, but with the exception of having to provide one, and later shut it down, this abstraction almost never leaks.

The most common place for issues to arise is neglecting to make sure the actor system is shut down in all exit paths – it’ll prevent the JVM from exiting if it hits a path that was forgotten.

More positive aspects are that computation is completely deferred to when the stream is materialized, so the problems with invalid instances which can make Iterator difficult to work with are a non-issue. This also avoids the tendency to inappropriately memoize that scala.collection.immutable.Stream exhibits.

Performance

Without memory constraints, Akka Streams managed to land right between the Scala Standard Library and the first Java reference implementation, both in terms of speed and consistency.

Akka does not fare well for this use-case in a memory-constrained environment. This may be because the actor system permanently takes up a chunk of memory and lowers the amount of time between GC runs, though I haven’t profiled this aspect to be certain.

library	env	wall clock (mm:ss ± %)	% of best in env	% of best	% of reference	% change from local
Scala StdLib	local	00:36.643 ± 1.91 %	100.00 %	100.00 %	20.34 %	0.00 %
Akka Streams	local	00:55.586 ± 2.30 %	151.69 %	151.69 %	30.85 %	0.00 %
Scala StdLib	EC2	02:02.973 ± 8.83 %	100.00 %	335.59 %	68.26 %	235.59 %
Java StdLib	EC2	03:00.161 ± 23.98 %	146.50 %	491.66 %	100.00 %	131.71 %
Akka Streams	EC2	03:51.666 ± 4.68 %	188.39 %	632.21 %	128.59 %	316.77 %

Memory Usage

The memory usage for Akka Streams is both considerably larger and highly consistent, so I’m inclined to chalk up the difference to the overhead of the Actor System.

library	env	peak memory used (mb ± %)	% of best in env	% of best	% of reference
Java StdLib	EC2	328.89 ± 9.71 %	100.00 %	102.30 %	100.00 %
Scala StdLib	EC2	365.64 ± 0.06 %	111.17 %	113.73 %	111.17 %
Akka Streams	EC2	367.27 ± 0.56 %	111.67 %	114.23 %	111.67 %
Scala StdLib	local	916.20 ± 7.59 %	284.97 %	284.97 %	278.57 %
Akka Streams	local	1434.66 ± 0.42 %	446.23 %	446.23 %	436.21 %

Conclusion

Akka Streams are at their best when dealing with asynchronous transformations, where the push/pull of back-pressure can make a big difference in how much work is performed. Reading a file, line by line, and producing some aggregate information isn’t the best fit. Although it acquitted itself decently, there’s probably a lighter-weight library which might be a better choice, and it really struggles when memory is tight.

See in git repo

Up next: a library designed to be lighter