Using Scala to Read Really, Really Large Files – Part 0: Introduction

 | 

Every development team has a set of standard practices and approaches to familiar problems. While this is generally a good way to avoid
choice paralysis, it can also lead to stagnation, so it’s good to occasionally reevaluate the standard strategies when encountering new techniques or information.

One of the familiar problems at Livongo is how to process exceptionally large files, which pops up in some surprising places. Our standard strategy has been to process these with Akka Streams or Apache Spark, depending on the shape of the file and what we need done with it. This has, by and large, worked quite well for us.

Even so, it’s healthy to occasionally revisit these standard approaches, and when Using Java to Read Really, Really Large Files popped up in my news feed, it presented an excellent opportunity to do exactly that. For a challenge to be a good benchmark, it needs to fulfill some minimal criteria:

  • ☑ Realistic
    This challenge is focused on record-oriented files, which matches the shape of majority of the files we have to process. This challenge would not be nearly as applicable to processing files containing serialized graphs, like JSON or XML.
  • ☑ Deterministic
    While the GC pauses and the realities of being run in a multi-tasking OS do inject some non-determinism, the file provided by the challenge is large enough that these shouldn’t be dominant factors.
  • ☑ Unbiased
    Because this challenge used real-world data, and was originally tackled in NodeJs and Java, it’s unlikely to be biased towards a particular implementation.

As a bonus, the original post included a link to the code and my curiosity was piqued by the chance to compare a Java implementation with our Scala implementations, to see how they held up in comparison.

In order to make this information as useful as possible, we’re structuring this as a series of posts so that it’s easy to get to the relevant information quickly. This post will serve as an introduction and table of contents. Links to the other articles can be found at the end of this post. All the code samples come from the git repository for this series.

The Challenge

The challenge is reproduced here for completeness, feel free to skip this section if you’ve already read the original article.

The requirements are to produce a report from the raw data provided by the US Federal Elections Commission (download link) containing the following information:

  1. The total number of lines in the file.
  2. The first, 432nd, and 43243rd names.
  3. How many donations were made in each month.
  4. The most common first name, and how often it occurs.

The file is line-oriented, and pipe (|) delimited. The donor name is in the 8th column, and the donation date is in the 5th.

The Java implementation is acting as a reference implementation, so the following behaviors will be retained:

  1. Months in different years count as separate months.
  2. Personal names are assumed to have this format:
    family, given middle1 middle2 ... middleN.
  3. Names without a comma are assumed to belong to businesses, and will be ignored.

Criteria

When looking for a library to tackle the problem of a multi-gigabyte file, there are certain qualities we’d like this library to have and we’ll use these to evaluate our implementations.

Ergonomics

The best libraries are easy to use, which is an unfortunately fuzzy metric, so we’re going to lean into this and rate it on a completely subjective scale:

Symbol Meaning
😀 Excellent
😐 Not great, and not terrible
😕 Disappointing
😞 Awful

Safety

If something goes wrong halfway through the file, how much work does it take to make sure the program either recovers or handles the error gracefully? How easy is it for the programmer to make a mistake?

As this is also very subjective, we’ll use the same fuzzy scale as ergonomics.

Performance

Most of the large files we have to process are done by background processes, and to get the most relevant information possible our performance data will be from just before the file is opened to just after the result is returned, in a fresh JVM for each trial.

This will be measured in milliseconds and aggregated as the minimum, maximum, average, standard deviation, and coefficent of variation. Because humans don’t compare tens of thousands of milliseconds all that well, the data will be formatted in minutes, seconds, and milliseconds.

The average ± coefficient of variation will be used as a summary metric for general comparisons in the results tables, the rest of the data will be available in the git repository associated with this post.

Memory Usage

Ideally we’d be able to gather several metrics related to memory usage, however for simplicity the one we’ll gather is peak memory usage. This should provide a stable data point for comparison.

As measuring is not 100% reliable (at least as far as I am aware), these values should be taken with a grain of salt. As the same methodology is being used for each implementation, the relative values should still be
useful.

This will be measured in bytes and aggregated as the minimum, maximum, average, standard deviation, and coefficent of variation. Because humans don’t compare hundreds of thousands of bytes all that well, the data will be formatted in megabytes.

The average ± coefficient of variation will be used as a summary metric for general comparisons in the results tables, the rest of the data will be available in the git repository associated with this post.

Author’s Note

Two measurements will be presented for every implementation except the Java reference implementation. Initially, the Java implementation had a few issues (see part 7 for details) which hurt its metrics.

The first round of benchmarks were run on a MacBook Pro, and were not memory constrained. Because the full run took about 6 hours, when it came time to rerun the benchmarks, an EC2 instance was spun up so they wouldn’t hog my development machine.

The EC2 instance ended up being memory constrained for most implementations, which provides and interesting perspective on how the performance of each implementation is affected by memory limits.

Rather than show a partial picture, I’ve decided to include both sets of measurements side-by-side, as well as the percent change of average runtime. The percent change in memory usage won’t be reported because, as almost all of them hit the memory cap, it wouldn’t be a useful metric.

The one exception to this is the Java implementation. While both sets of data will be included in the section devoted specifically to how our reference algorithm is implemented, at the end of each section only the EC2 version will be included. This is because the second version is the only version that’s fair to Java to use (due to the flaws in the first implementation).

Table of Contents

  1. Setup – Implementation details for the curious
  2. Standard Libraries – The Scala Collections Library + Java NIO
  3. better-files – Syntactic sugar
  4. Akka Streams – Reactive Streams library
  5. fs2-io – Functional Streams for Scala (IO helpers)
  6. FS2 (core) – Functional Streams for Scala (core library)
  7. Reference Implementation – Vanilla Java NIO
  8. Summary