Using Scala to Read Really, Really Large Files – Part 7: The Reference Implementation

The Java implementation used as a reference, both for correctness and “how long would this take to do in Java”, is a lightly modified version of this implementation, taken from the git repo of the original article.

Implementation

This section is going to be considerably larger than previous parts as it’s less abstracted, and not as large as one might expect as most of the logic will be skipped and only the interesting bits will be noted.

Overview

The basic logic is the same, the primary structural difference is that it’s a mutable operation centered on a huge while loop. This is getting a bit on the long side for a Java method, though still within what I dealt with back when worked primarily in Java.

public class JavaStdLib implements FileReader {
    private JavaStdLib() {
    }

    @Override
    public String description() {
        return "Java StdLib";
    }

    @Override
    public Result consume(Path path) {
        Pattern columnDelimiters = Pattern.compile("\\s*\\|\\s*");
        Pattern nameDelimiter = Pattern.compile(", ");
        try (BufferedReader reader = Files.newBufferedReader(path)) {
            String readLine;
            int lines = 0;

            Set<Integer> indexes = new HashSet<>();
            indexes.add(1);
            indexes.add(433);
            indexes.add(43244);

            Map<Integer, String> specialNames = new HashMap<>();
            Map<DateBucket, Integer> donationsPerYearMonth = new HashMap<>();
            Map<String, Integer> firstNameOccurrences = new HashMap<>();

            while ((readLine = reader.readLine()) != null) {
                lines++;
                String columns[] = columnDelimiters.split(readLine, 9);
                String name = columns[7];
                if (indexes.contains(lines)) {
                    specialNames.put(lines - 1, name);
                }

                String nameParts[] = nameDelimiter.split(name, 3);
                if (nameParts.length > 1) {
                    String firstAndMiddle = nameParts[1].trim();
                    if (!firstAndMiddle.isEmpty()) {
                        String firstNameParts[] = firstAndMiddle.split(" ");
                        firstNameOccurrences.merge(firstNameParts[0].trim(), 1, (a, b) -> a + b);
                    }
                }

                String rawDate = columns[4];
                Integer month = Integer.parseInt(rawDate.substring(4, 6));
                Integer year = Integer.parseInt(rawDate.substring(0, 4));

                donationsPerYearMonth.merge(DateBucket.apply(year, month), 1, (a, b) -> a + b);
            }

            Map.Entry<String, Integer> mostCommonFirstName = Collections.max(
                    firstNameOccurrences.entrySet(),
                    Map.Entry.comparingByValue());

            return Result.fromJava(
                    lines,
                    specialNames,
                    donationsPerYearMonth,
                    mostCommonFirstName.getKey(),
                    mostCommonFirstName.getValue()
            );
        } catch (IOException e) {
            throw new RuntimeException(e);
        }
    }

    public static JavaStdLib implementation = new JavaStdLib();
}

`Result.fromJava`

It turns out that the Java compiler doesn’t handle some Scala features gracefully, so there needed to be a method on the Scala side to handle the conversions. The length of the collections at play is so small that it shouldn’t affect the outcome of the timing.

Ergonomics 😕

These bits made me really wish this method was defined on Scala’s Map. Well done, Java:

 donationsPerYearMonth.merge(DateBucket.apply(year, month), 1, (a, b) -> a + b);

Otherwise, it wasn’t great. Particularly bad is that the logic handling the record layout is scattered through the body of the method, so changing it would be error-prone.

Safety 😞

Other than “don’t use an int when you mean a String”, there wasn’t much. The whole thing is completely mutable and isn’t clean to reason about.

Performance

I really expected this to be even with (or possibly beat) the Scala Standard Library. Functional programming and immutable data tend to have a performance cost, which is generally more than justified by other factors. The Java implementation is mutable and extremely simple, so both the longer average runtime and the greater variability between measurements were a huge surprise.

This was such a surprise that I pulled the git repo for the original challenge and ran the pure Java version a couple of times manually as a sanity check. The results were in the 88 seconds range, so the automatically gathered results seem reliable.

Still, it seemed like such an outliner that I ran it by one of our newer Livongoans, Polina Spivak, whose Java chops were considerably less rusty than mine.

We were able to improve on the implementation somewhat, knocking about 2 minutes off the average runtime through various optimizations. We got the biggest lift by switching from using the split provided by the String class to the split provided by the Pattern class to cache the regex compilation.

The current implementation probably has other optimizations which could hopefully bring it closer to the times we saw in the Scala standard library.

library	env	wall clock (mm:ss ± %)	% of best in env	% of best	% of reference	% change from local
Scala StdLib	local	00:36.643 ± 1.91 %	100.00 %	100.00 %	20.34 %	0.00 %
Java StdLib	local	01:17.751 ± 4.42 %	212.18 %	212.18 %	43.16 %	0.00 %
Scala StdLib	EC2	02:02.973 ± 8.83 %	100.00 %	335.59 %	68.26 %	235.59 %
Java StdLib	EC2	03:00.161 ± 23.98 %	146.50 %	491.66 %	100.00 %	131.71 %

Memory Usage

The Java version ruled the memory usage metric, beating the next most frugal library by half. This isn’t surprising, because of the extra allocations involved in using immutable data. What’s surprising is that the GC was able to handle the extra allocations incurred by the Scala implementations without the more frequent GC cycles providing a noticeable edge to the Java implementation.

library	env	peak memory used (mb ± %)	% of best in env	% of best	% of reference
Java StdLib	local	321.51 ± 7.93 %	100.00 %	100.00 %	97.75 %
Java StdLib	EC2	328.89 ± 9.71 %	100.00 %	102.30 %	100.00 %
Scala StdLib	EC2	365.64 ± 0.06 %	111.17 %	113.73 %	111.17 %
Scala StdLib	local	916.20 ± 7.59 %	284.97 %	284.97 %	278.57 %

Conclusion

If memory is extremely constrained, it might be worth looking at as a last recourse.

See in git repo

Up next: TL;DR