Using Scala to Read Really, Really Large Files – Part 3: better-files

better-files is a dependency-free pragmatic thin Scala wrapper around Java NIO

Basically: scala.io.Source, done right. It’s simple enough that the user manual fits in the github landing page.

Implementation

object BetterFiles extends FileReader {
  override def consume(path: Path): Result =
    path.toFile.toScala.lineIterator.foldLeft(LineMetricsAccumulator.empty)(_ addLine _).asResult

  override def description: String = "better-files";
}

One of the really nice parts of this library is that the various iterators it creates close themselves when the end of the file is reached or an exception occurs. This makes the code much simpler to write, and helps lessen the creation/cleanup coupling issues that crop up when using the standard library version.

Safety 😕

Unfortunately, their abstractions leak a bit – partially because the difference between this and a vanilla iterator isn’t represented in the type system. This means that it’d be easy to mix them up when migrating and forget to close a resource.

The other big issue is that it’s not intuitive that partial iteration doesn’t close the underlying resource. This is unfortunately something of a necessity, and comes down to the issues inherent in building an API over a mutable resource like Iterator while exposing that resource.

For example: which iterator should close the file in this example, someEvens or someOdds?

val (evens, odds) =
  path.toFile.toScala
      .lineIterator
      .map(_.toInt)
      .partition(_ % 2 == 0)

val someEvens = evens.take(5)
val someOdds  = odds.take(5)

It’s not really a choice that can be made without out of band knowledge of the file and program, so it makes sense that this was punted to the user. Unfortunately, the types don’t reflect this, so it’s easy to lose track of what iterators close the underlying file and which don’t.

Performance

Unsurprisingly, the performance characteristics of Better Files are nearly identical to those of the Standard Library. Other than that, there’s not much to say about it, other than what was already said about the Scala Standard Library in part 2.

library	env	wall clock (mm:ss ± %)	% of best in env	% of best	% of reference	% change from local
Scala StdLib	local	00:36.643 ± 1.91 %	100.00 %	100.00 %	20.34 %	0.00 %
better-files	local	00:36.818 ± 2.46 %	100.48 %	100.48 %	20.44 %	0.00 %
Scala StdLib	EC2	02:02.973 ± 8.83 %	100.00 %	335.59 %	68.26 %	235.59 %
better-files	EC2	02:04.564 ± 3.09 %	101.29 %	339.93 %	69.14 %	238.32 %
Java StdLib	EC2	03:00.161 ± 23.98 %	146.50 %	491.66 %	100.00 %	131.71 %

Memory Usage

Memory usage was also a very close match to the Standard Library. Turns out it’s a very thin wrapper.

library	env	peak memory used (mb ± %)	% of best in env	% of best	% of reference
Java StdLib	EC2	328.89 ± 9.71 %	100.00 %	102.30 %	100.00 %
better-files	EC2	365.59 ± 0.07 %	111.16 %	113.71 %	111.16 %
Scala StdLib	EC2	365.64 ± 0.06 %	111.17 %	113.73 %	111.17 %
Scala StdLib	local	916.20 ± 7.59 %	284.97 %	284.97 %	278.57 %
better-files	local	920.19 ± 9.03 %	286.21 %	286.21 %	279.79 %

Conclusion

For simple transformations, the usability boost over the Scala Standard Library makes such a difference that there isn’t really any reason not to use Better Files. For more complicated transformations, it’s probably easier to use one of the more expressive libraries to avoid the safety issues.

See in git repo

Up next: a very expressive library