Boxes v0.22

In the previous lesson, we reimplemented our line breaking algorithm and made it capable of choosing the best possible breaking point giving constraints on what characters are allowable as breaking points and page width.

If you want to get technical and improve the algorithm there is literature about it. The next big step would be not breaking lines independently of each other but to instead consider the paragraph to be the natural unit and try to keep the "tightness" of ourspacing consistent across it.

That change is what Knuth proposes in his historical paper Breaking Paragraphs into Lines ...which we are not going to ever consider here. It would be possible to tweak our code in that direction. But we are trying to learn something and our text layout engine is, really, a bit of a MacGuffin.

What I am trying to teach here is not how to do a text layout engine, but how to do a text layout engine. The important part is the how, not the what.

So, having duly noted that there is vast room for improvement in the algorithm, we will turn to one last pass of improvements in our code.

Here is our fill_row() as of lesson 10:

117def fill_row(boxes, page, separation):
118    """Fill a row with elements removed from boxes.
119
120    The elements put in the row should be a good fit for laying out on
121    page considering separation.
122    """
123
124    # Calculate initial breaking point
125    row = []
126    x = page.x
127    while boxes:
128        b = boxes.pop(0)
129        row.append(b)
130        b.x = x
131        if is_breaking(b, page):
132            break
133
134        x = x + b.w + separation
135    # Calculate badness for previous breaking points
136    badnesses = {}
137    for i in range(1, len(row)):
138        _row = row[:i + 1]
139        if _row[-1].letter in [" ", "\xad", "\n"]:
140            how_bad = badness(page.w, _row)
141            badnesses[how_bad] = _row
142    if badnesses:
143        # Find minimum badness
144        min_bad = min(badnesses.keys())
145        _row = badnesses[min_bad]
146        # Put leftover letters back in boxes
147        for b in row[:len(_row) - 1:-1]:
148            boxes.insert(0, b)
149    else:
150        _row = row
151    return _row

It's basically two loops:

  1. Calculate the 1st overfull breaking point
  2. See if any previous breaking point is better

We could make it a single loop. We would track each possible breaking point's badness as we reach it, and then when we are overfull pick the minimum badness, and so on.

That would be a speed improvement. An optimization.

Faster is better. But it's not the only thing that is better. It's not true that everything that is faster is better. The following things are also better:

  • Readable
  • Easy to change in the future
  • Easy to test
  • Easy to debug

And (in my opinion) making it a single loop hurts all of those things. There is a saying in development: "premature optimization is the root of all evil" which, like all absolutes, is not true, but it is mostly true.

I prefer an alternative version: "It's easier to make correct code faster than to make fast code correct."

So, let's work on making it correct. Then, maybe, if it's slow, we will make it fast.

Just for fun, I had this code typeset all of Pride and Prejudice. It took about 5 minutes. It also created a 72MB SVG file that took 20GB of RAM and 10 minutes to open using Inkscape.

For comparison, pdfLaTeX with the most basic template (no packages at all) took less than a second to produce a 720KB PDF file.

Does that mean our code is slow? Yes and no.

Yes, because we have evidence that a similar thing can be done much faster. No, because we are not Knuth, he took years to do it, and we have not optimized anything.

In fact, here's what Knuth had to say on the first day TeX produced a page:

Came in evening after sleeping most of day, to get computer at better time. Some day we will have personal computers and will live more normally.

— D. E. Knuth, The Errors of TeX, 14 March 1978

First: There is a whole thing in that code about _row and row that looks so bad and confusing. Just renaming things can make code better:

135    # Calculate badness for previous breaking points
136    result = row
137    badnesses = {}
138    for i in range(1, len(row)):
139        partial_row = row[:i + 1]
140        if partial_row[-1].letter in [" ", "\xad", "\n"]:
141            how_bad = badness(page.w, partial_row)
142            badnesses[how_bad] = partial_row
143    if badnesses:
144        # Find minimum badness
145        min_bad = min(badnesses.keys())
146        partial_row = badnesses[min_bad]
147        # Put leftover letters back in boxes
148        for b in row[:len(partial_row) - 1:-1]:
149            boxes.insert(0, b)
150        result = partial_row
151    return result

Another code smell is this hardcoded list:

139        if _row[-1].letter in [" ", "\xad", "\n"]:

That needs to be a constant. We should add '\n' to our list of BREAKING_CHARS and use it.

There is also this strange code:

146        # Put leftover letters back in boxes
147        for b in row[:len(_row) - 1:-1]:
148            boxes.insert(0, b)

Why is that code there? Because fill_row() is supposed to consume boxes and move the consumed elements into row and return that, and leave the unconsumed ones for the next iteration.

Since we are "overeating" we need to put those extra boxes back in the pot, and there is no equivalent to extend() that would let us re-insert them at the beginning of the boxes list.

The real problem is that we don't want a list, we want a deque. Choosing the right data structures for your data is important.

The deque has the advantage that it allows for fast pops and appends on both ends of the container.

So, if boxes were a deque then we could just use extendleft:

boxes.extendleft(reversed(row[len(partial_row):]))

In layout() instead of creating a copy of boxes we can create a deque:

154def layout(_boxes, _pages, separation):
155    """Layout boxes along pages."""
156
157    # We modify these lists, so use copies
158    boxes = deque(_boxes)

When you change a data structure, it will trigger a propagation of changes across your code. You can see what happened in our code diff.

Was it worth it? Dubious. While a deque is indeed the proper data structure, it's also not very well known, and the next person looking at the code will be surprised because it looks a lot like a list but is not a list.

We "fixed" ugly code by making other code slightly more complicated and surprising than before. That is a tradeoff we should be aware of. I will leave it there but we could just as easily not have done it.

Trade offs of that sort happen all the time. Every time you add a feature, or fix a bug, or refactor something, you are paying a cost, and you are getting a payoff. It's on you to get payoffs that offset your costs. You will suck at it for a long time, and in the end it's a subjective matter and nobody has the truth. Just don't stress too much over it.

We can fix more things. I have left you subtle hints in the code:

78    elif row[-1].letter == "\xad":
79        # This looks pretty bad, doesn't it?
80        hyphen = hyphenbox()
81        row[-1].letter = hyphen.letter
82        row[-1].w = hyphen.w
54    if slack < 0:
55        # Arbitrary fudge factor, negative slack is THIS much worse
56        badness *= -2

But let's just try to see if we have obvious bugs, instead. At one point, when we introduced the ability of specifying page sizes that exposed a lot of issues.

It looked like this:

$ python code/lesson2/boxes.py pride-and-prejudice.txt lesson2.svg --page-size=10x20

lesson2.svg

$ python code/lesson11/boxes.py pride-and-prejudice.txt lesson11.svg --page-size=10x20

lesson11.svg

That looks better! No overflows, no strangely dark smushed lines... but... look at the vertical spacing. Before, our rows were flush to the top of the page, and now there is more space there.

Looking back, that change seems to have happened a while ago, in lesson 8. That is interesting, because I honestly did not notice until now. Bugs will be unnoticed. Yes, even large bugs. Even very large bugs. It will happen to you.

Also, if we render with the normal page sizes, you can see that the bottom row on each page is actually partly out of the page!

$ python code/lesson11/boxes.py pride-and-prejudice.txt lesson11.1.svg

lesson11.1.svg

So, yes, we have an obvious bug. Let's fix it. Remember TDD from lesson 9? Let's create a test that exposes the bug.

 1import pytest
 2
 3import boxes
 4
 5lipsum = (
 6    "Lorem ipsum dolor sit amet, consectetur adipiscing elit."
 7    " Nulla sollicitudin justo id libero pharetra, at vehicula nisi "
 8    "pharetra. Nullam sit amet porttitor arcu. Duis purus dui, luctus"
 9    " ut ante sed, varius rhoncus diam. Donec faucibus erat in venenatis"
10    " dignissim. Cras vitae faucibus dui, a feugiat elit. Fusce non egestas"
11    " velit, nec suscipit erat.\n\n"
12)
13
14
15def test_page_overflow(tmpdir):
16    """When laying down text, it should not extend below the page."""
17
18    # Create a normal text layout
19    pages = boxes.create_pages((30, 50))
20    text = lipsum * 20
21    inp = tmpdir.mkdir("sub").join("lipsum.txt")
22    inp.write(text)
23    text_boxes = boxes.create_text_boxes(inp)
24
25    boxes.layout(text_boxes, pages, 0.05)
26
27    max_y = max(b.y + b.h for b in text_boxes)
28    assert max_y < pages[0].h

It does indeed fail!

env PYTHONPATH=. pytest tests/test_layout.py
============================= test session starts ==============================
platform linux -- Python 3.6.2, pytest-3.5.0, py-1.5.3, pluggy-0.6.0
rootdir: lesson11, inifile:
collected 1 item

tests/test_layout.py F                                                   [100%]

=================================== FAILURES ===================================
______________________________ test_page_overflow ______________________________

tmpdir = local('/tmp/pytest-of-ralsina/pytest-25/test_page_overflow0')

    def test_page_overflow(tmpdir):
        """When laying down text, it should not extend below the page."""

        # Create a normal text layout
        pages = boxes.create_pages((30,50))
        text = lipsum * 20
        inp = tmpdir.mkdir('sub').join('lipsum.txt')
        inp.write(text)
        text_boxes = boxes.create_text_boxes(inp)

        boxes.layout(text_boxes, pages, 0.05)

        max_y = max(b.y + b.h for b in text_boxes)
>       assert max_y < pages[0].h
E       assert 50.349999999999966 < 50
E        +  where 50 = Box(0, 0, 30, 50, "x").h

tests/test_layout.py:28: AssertionError
=========================== 1 failed in 0.42 seconds ===========================

So, we are laying down text that goes below the lower edge of the page.

The fix is pretty simple, and it was just a silly bug. It looked like this:

175        # Put all the letters in the right vertical position
176        y = y + row[0].h + separation
177        for b in row:
178            b.y = y

And it should have looked like this:

175        # Put all the letters in the right vertical position
176        for b in row:
177            b.y = y
178        y = y + row[0].h + separation

Is it fixed?

$ python code/lesson11.1/boxes.py pride-and-prejudice.txt lesson11.2.svg

lesson11.2.svg

Yes, yes it is.

And here, after 11 lessons, is where we stop. We have a decent text layout engine here. Is it state of the art? Hell no. Is it fast? Probably not. Is it great? Nope.

But we understand it. And we understand how to fix it, and how to improve it. And it does something useful-looking! Do you know any other tools to typeset SVG files out of text? I don't!

This is as good a time as any to move on. Let's do a quick recap!


results matching ""

    No results matching ""