Using Scala to Read Really, Really Large Files – Part 2: The Standard Libraries

 | 

The Scala Standard Library offers a zero-dependency way of processing really, really large files that’s surprisingly reasonable.

Implementation

object StdLib extends FileReader {
  override def consume(path: Path): Result = {
    val source = Source.fromFile(path.toFile, 4096)(Codec.UTF8)
    try {
      source
        .getLines()
        .foldLeft(LineMetricsAccumulator.empty)(_ addLine _)
        .asResult
    } finally source.close()
  }

  override def description: String = "Scala StdLib";
}

Ergonomics 😐

Overall, scala.io.Source provides a decent API for simpler processing tasks. It’s a little awkward because Scala doesn’t have syntactic sugar equivalent to Java’s try-with-resource construct, which would be more of
an issue if Source.fromFile didn’t abstract away the nested readers.

Source#getLines produces a standard Iterator, so if you’re familiar with the Collections API, all those goodies are there and you’ll probably be able to avoid interacting with the low-level Iterator API. The flip side of this presents itself is the programmer needs to be very
careful, because Iterator is mutable, many of the methods in the Collections API invalidate the underlying instance.

Finally, and (subjectively) the biggest pain point: the lifespan of this Source must be wrapped in some construct to ensure it closes properly. This really limits what can be done with Source in terms of asynchronous operations. The difficulty decoupling the start, middle, and end stages also makes writing unit tests more difficult.

Safety 😕

A big missed opportunity is the methods on the Iterator returned by Source#getLines don’t close their parent upon completion or failure. This means that this implementation, while intuitive, would leak open file descriptors:

object ScalaStdLib extends FileReader {
  override def consume(path: Path): Result =
    Source
      .fromFile(path.toFile, 4096)(Codec.UTF8)
      .getLines()
      .foldLeft(LineMetricsAccumulator.empty)(_ addLine _)
      .asResult
  
  override def description: String = "Scala StdLib"
}

Performance

The Standard Library had the best performance, and was one of the more consistent between runs. While the low-memory environment of the EC2 instance severely impacted performance, it still managed to out-perform the other implementations.

library env wall clock (mm:ss ± %) % of best in env % of best % of reference % change from local
Scala StdLib local 00:36.643 ± 1.91 % 100.00 % 100.00 % 20.34 % 0.00 %
Scala StdLib EC2 02:02.973 ± 8.83 % 100.00 % 335.59 % 68.26 % 235.59 %
Java StdLib EC2 03:00.161 ± 23.98 % 146.50 % 491.66 % 100.00 % 131.71 %

Memory Usage

The extra speed comes at a cost, as when run locally the peak memory usage was nearly three times that of the Java reference implementation – despite using a smaller cache (4,096 bytes vs 8,192 bytes). When memory is constrained, it hangs out right at the upper limit of what it can ,use.

library env peak memory used (mb ± %) % of best in env % of best % of reference
Java StdLib EC2 328.89 ± 9.71 % 100.00 % 102.30 % 100.00 %
Scala StdLib EC2 365.64 ± 0.06 % 111.17 % 113.73 % 111.17 %
Scala StdLib local 916.20 ± 7.59 % 284.97 % 284.97 % 278.57 %

Conclusion

The Scala Standard Library is a performant solution, but due to some of the usability issues, and how small of a gap there is between it and the nearest competing library, it’s probably best to leave this one for interview questions.

See in git repo

Up next: a better way?