Using Scala to Read Really, Really Large Files – Part 3: better-files

 | 

better-files is a dependency-free pragmatic thin Scala wrapper around Java NIO

README.md

Basically: scala.io.Source, done right. It’s simple enough that the user manual fits in the github landing page.

Implementation

object BetterFiles extends FileReader {
  override def consume(path: Path): Result =
    path.toFile.toScala.lineIterator.foldLeft(LineMetricsAccumulator.empty)(_ addLine _).asResult

  override def description: String = "better-files";
}

Ergonomics 😀

One of the really nice parts of this library is that the various iterators it creates close themselves when the end of the file is reached or an exception occurs. This makes the code much simpler to write, and helps lessen the creation/cleanup coupling issues that crop up when using the standard library version.

Safety 😕

Unfortunately, their abstractions leak a bit – partially because the difference between this and a vanilla iterator isn’t represented in the type system. This means that it’d be easy to mix them up when migrating and forget to close a resource.

The other big issue is that it’s not intuitive that partial iteration doesn’t close the underlying resource. This is unfortunately something of a necessity, and comes down to the issues inherent in building an API over a mutable resource like Iterator while exposing that resource.

For example: which iterator should close the file in this example, someEvens or someOdds?

val (evens, odds) =
  path.toFile.toScala
      .lineIterator
      .map(_.toInt)
      .partition(_ % 2 == 0)

val someEvens = evens.take(5)
val someOdds  = odds.take(5)

It’s not really a choice that can be made without out of band knowledge of the file and program, so it makes sense that this was punted to the user. Unfortunately, the types don’t reflect this, so it’s easy to lose track of what iterators close the underlying file and which don’t.

Performance

Unsurprisingly, the performance characteristics of Better Files are nearly identical to those of the Standard Library. Other than that, there’s not much to say about it, other than what was already said about the Scala Standard Library in part 2.

library env wall clock (mm:ss ± %)  % of best in env  % of best  % of reference  % change from local
Scala StdLib local 00:36.643 ±  1.91 % 100.00 % 100.00 % 20.34 % 0.00 %
better-files local 00:36.818 ±  2.46 % 100.48 % 100.48 % 20.44 % 0.00 %
Scala StdLib EC2 02:02.973 ±  8.83 % 100.00 % 335.59 % 68.26 % 235.59 %
better-files EC2 02:04.564 ±  3.09 % 101.29 % 339.93 % 69.14 % 238.32 %
Java StdLib EC2 03:00.161 ± 23.98 % 146.50 % 491.66 % 100.00 % 131.71 %

Memory Usage

Memory usage was also a very close match to the Standard Library. Turns out it’s a very thin wrapper.

library env peak memory used (mb ± %)  % of best in env  % of best  % of reference
Java StdLib EC2 328.89 ± 9.71 % 100.00 % 102.30 % 100.00 %
better-files EC2 365.59 ± 0.07 % 111.16 % 113.71 % 111.16 %
Scala StdLib EC2 365.64 ± 0.06 % 111.17 % 113.73 % 111.17 %
Scala StdLib local 916.20 ± 7.59 % 284.97 % 284.97 % 278.57 %
better-files local 920.19 ± 9.03 % 286.21 % 286.21 % 279.79 %

Conclusion

For simple transformations, the usability boost over the Scala Standard Library makes such a difference that there isn’t really any reason not to use Better Files. For more complicated transformations, it’s probably easier to use one of the more expressive libraries to avoid the safety issues.

See in git repo

Up next: a very expressive library