The Scala Standard Library offers a zero-dependency way of processing really, really large files that’s surprisingly reasonable.
Implementation
object StdLib extends FileReader { override def consume(path: Path): Result = { val source = Source.fromFile(path.toFile, 4096)(Codec.UTF8) try { source .getLines() .foldLeft(LineMetricsAccumulator.empty)(_ addLine _) .asResult } finally source.close() } override def description: String = "Scala StdLib"; }
Ergonomics 😐
Overall, scala.io.Source provides a decent API for simpler processing tasks. It’s a little awkward because Scala doesn’t have syntactic sugar equivalent to Java’s try-with-resource construct, which would be more of
an issue if Source.fromFile didn’t abstract away the nested readers.
Source#getLines produces a standard Iterator, so if you’re familiar with the Collections API, all those goodies are there and you’ll probably be able to avoid interacting with the low-level Iterator API. The flip side of this presents itself is the programmer needs to be very
careful, because Iterator is mutable, many of the methods in the Collections API invalidate the underlying instance.
Finally, and (subjectively) the biggest pain point: the lifespan of this Source must be wrapped in some construct to ensure it closes properly. This really limits what can be done with Source in terms of asynchronous operations. The difficulty decoupling the start, middle, and end stages also makes writing unit tests more difficult.
Safety 😕
A big missed opportunity is the methods on the Iterator returned by Source#getLines don’t close their parent upon completion or failure. This means that this implementation, while intuitive, would leak open file descriptors:
object ScalaStdLib extends FileReader { override def consume(path: Path): Result = Source .fromFile(path.toFile, 4096)(Codec.UTF8) .getLines() .foldLeft(LineMetricsAccumulator.empty)(_ addLine _) .asResult override def description: String = "Scala StdLib" }
Performance
The Standard Library had the best performance, and was one of the more consistent between runs. While the low-memory environment of the EC2 instance severely impacted performance, it still managed to out-perform the other implementations.
library | env | wall clock (mm:ss ± %) | % of best in env | % of best | % of reference | % change from local |
---|---|---|---|---|---|---|
Scala StdLib | local | 00:36.643 ± 1.91 % | 100.00 % | 100.00 % | 20.34 % | 0.00 % |
Scala StdLib | EC2 | 02:02.973 ± 8.83 % | 100.00 % | 335.59 % | 68.26 % | 235.59 % |
Java StdLib | EC2 | 03:00.161 ± 23.98 % | 146.50 % | 491.66 % | 100.00 % | 131.71 % |
Memory Usage
The extra speed comes at a cost, as when run locally the peak memory usage was nearly three times that of the Java reference implementation – despite using a smaller cache (4,096 bytes vs 8,192 bytes). When memory is constrained, it hangs out right at the upper limit of what it can ,use.
library | env | peak memory used (mb ± %) | % of best in env | % of best | % of reference |
---|---|---|---|---|---|
Java StdLib | EC2 | 328.89 ± 9.71 % | 100.00 % | 102.30 % | 100.00 % |
Scala StdLib | EC2 | 365.64 ± 0.06 % | 111.17 % | 113.73 % | 111.17 % |
Scala StdLib | local | 916.20 ± 7.59 % | 284.97 % | 284.97 % | 278.57 % |
Conclusion
The Scala Standard Library is a performant solution, but due to some of the usability issues, and how small of a gap there is between it and the nearest competing library, it’s probably best to leave this one for interview questions.
Up next: a better way?