Boxes v0.21

In the previous lesson, we looped another loop in the process of iteratively improving our code. After we did a large refactor in lesson 7 we had to pay the price by fixing a couple of bugs we had introduced, and while doing it, we learned some new techniques.

That is pretty much what software development is. We make a push to get a feature implemented. We found a bug or two. We add tests and fix the bugs. We add another feature.

So now it's feature time. We want to choose our line breaks "better". But what does "better" mean in this context? A while ago we introduced a function called badness() to measure how bad a breaking point was. So, better could mean "has a smaller badness".

Without getting into the particulars of how we measure badness, this sounds reasonable, or at least makes intutitive sense to me:

  • Badness starts very high.
  • Badness decreases as we add text to the row.
  • At some point, badness starts increasing again.
  • Badness will then tend to grow forever as long as we add text.

Consider the following line breaks to see why this is the case. | indicates the right edge of the page, so breaking there would be ideal.

1The                            |              # Very bad
2The quick                      |              # Slightly less bad
3The quick brown                |              # Still bad
4The quick brown fox            |              # Not so bad
5The quick brown fox jumped     |              # A bit bad
6The quick brown fox jumped over|              # Perfect!
7The quick brown fox jumped over|the           # Pretty good
8The quick brown fox jumped over|the lazy      # Not so good
9The quick brown fox jumped over|the lazy dog. # Bad again

Adding more text to that row is not going to make badness decrease after that, is it? It's only going to get worse.

When trying to figure out a place to break a line, a local minimum of badness is what we want. Just calculate badness for a number of potential breaking places, and choose its minimum.

Let's consider an example we found in the previous lesson and see how its badness evolves over time. For this I wrote a tiny exploratory program (remember those from part 1?)

1import boxes
2text_boxes = boxes.create_text_boxes('line.txt')
3for i in range(1, len(text_boxes)):
4    text_boxes[i].x = text_boxes[i-1].x + text_boxes[i-1].w
5for i in range(50, len(text_boxes)):
6    row = text_boxes[:i+1]
7    print("{:.03f} | {}".format(
8        boxes.badness(25, row),
9        ''.join(x.letter for x in row)))

Some lines seem repeated because non-visible characters are added.

 12.819 | take de­light in vex­ing me. You have no com­pas­si
 22.567 | take de­light in vex­ing me. You have no com­pas­sio
 32.315 | take de­light in vex­ing me. You have no com­pas­sion
 41.944 | take de­light in vex­ing me. You have no com­pas­sion 
 51.832 | take de­light in vex­ing me. You have no com­pas­sion f
 61.608 | take de­light in vex­ing me. You have no com­pas­sion fo
 71.476 | take de­light in vex­ing me. You have no com­pas­sion for
 81.226 | take de­light in vex­ing me. You have no com­pas­sion for 
 90.931 | take de­light in vex­ing me. You have no com­pas­sion for m
100.748 | take de­light in vex­ing me. You have no com­pas­sion for my
110.589 | take de­light in vex­ing me. You have no com­pas­sion for my 
120.406 | take de­light in vex­ing me. You have no com­pas­sion for my p
130.223 | take de­light in vex­ing me. You have no com­pas­sion for my po
140.040 | take de­light in vex­ing me. You have no com­pas­sion for my poo
150.137 | take de­light in vex­ing me. You have no com­pas­sion for my poor
160.293 | take de­light in vex­ing me. You have no com­pas­sion for my poor 
170.628 | take de­light in vex­ing me. You have no com­pas­sion for my poor n
180.963 | take de­light in vex­ing me. You have no com­pas­sion for my poor ne
191.162 | take de­light in vex­ing me. You have no com­pas­sion for my poor ner
201.466 | take de­light in vex­ing me. You have no com­pas­sion for my poor nerv
211.801 | take de­light in vex­ing me. You have no com­pas­sion for my poor nerve
222.105 | take de­light in vex­ing me. You have no com­pas­sion for my poor nerves
232.272 | take de­light in vex­ing me. You have no com­pas­sion for my poor nerves.
242.471 | take de­light in vex­ing me. You have no com­pas­sion for my poor nerves.”
252.471 | take de­light in vex­ing me. You have no com­pas­sion for my poor nerves.”

As you can see the minimum is reached in line 15. However, that is not a good place to break a line, so let's iterate and filter this by showing only good breaking points:

 1import boxes
 2
 3text_boxes = boxes.create_text_boxes("line.txt")
 4for i in range(1, len(text_boxes)):
 5    text_boxes[i].x = text_boxes[i - 1].x + text_boxes[i - 1].w
 6for i in range(30, len(text_boxes)):
 7    row = text_boxes[:i + 1]
 8    if row[-1].letter in [" ", "\xad", "\n"]:
 9        print(
10            "{:.03f} | {}".format(
11                boxes.badness(25, row), "".join(x.letter for x in row)
12            )
13        )

And here is the output:

 18.653 | take de­light in vex­ing me. You 
 26.135 | take de­light in vex­ing me. You have 
 34.732 | take de­light in vex­ing me. You have no 
 43.882 | take de­light in vex­ing me. You have no com­
 53.150 | take de­light in vex­ing me. You have no com­pas­
 61.944 | take de­light in vex­ing me. You have no com­pas­sion 
 71.226 | take de­light in vex­ing me. You have no com­pas­sion for 
 80.589 | take de­light in vex­ing me. You have no com­pas­sion for my 
 90.293 | take de­light in vex­ing me. You have no com­pas­sion for my poor 
102.471 | take de­light in vex­ing me. You have no com­pas­sion for my poor nerves.”

Looks good to me. Of course, the "badness" algorithm is just guesswork, but it appears to have the general features of what we want, so it's good enough for a first approach.

How could we use badness() to improve our line breaking algorithm, which is implemented in fill_row()? First, let's look at what we are doing now:

117def fill_row(boxes, page, separation):
118    """Fill a row with elements removed from boxes.
119
120    The elements put in the row should be a good fit for laying out on
121    page considering separation.
122    """
123
124    row = []
125    x = page.x
126    while boxes:
127        b = boxes.pop(0)
128        row.append(b)
129        b.x = x
130        if is_breaking(b, page):
131            break
132
133        x = x + b.w + separation
134    return row

That is very simple. It's going to become slightly more complex. Here is a first approach:

117def fill_row(boxes, page, separation):
118    """Fill a row with elements removed from boxes.
119
120    The elements put in the row should be a good fit for laying out on
121    page considering separation.
122    """
123
124    # Calculate initial breaking point
125    row = []
126    x = page.x
127    while boxes:
128        b = boxes.pop(0)
129        row.append(b)
130        b.x = x
131        if is_breaking(b, page):
132            break
133
134        x = x + b.w + separation
135    # Calculate badness for previous breaking points
136    badnesses = {}
137    for i in range(1, len(row)):
138        _row = row[:i + 1]
139        if _row[-1].letter in [" ", "\xad", "\n"]:
140            how_bad = badness(page.w, _row)
141            badnesses[how_bad] = _row
142    if badnesses:
143        # Find minimum badness
144        min_bad = min(badnesses.keys())
145        _row = badnesses[min_bad]
146        # Put leftover letters back in boxes
147        for b in row[:len(_row) - 1:-1]:
148            boxes.insert(0, b)
149    else:
150        _row = row
151    return _row

Does it work? Yes, quite nicely! Compare: before...

lesson9.3.svg

And after...

lesson10.svg

Faced with an overfull line, our algorithm backtracks and breaks that line earlier, while underfull, instead of smushing it.

However I am very, very unhappy with that code.

  • All the _row / row stuff is confusing.
  • The boxes.insert() loop is icky but necessary because otherwise those boxes are already popped and lost.
  • We have a magic literal! [" ", "\xad", "\n"]

So, time for another round of refactoring, redefining our internal interfaces and trying to make pieces fit more smoothly.

But before ... Our tests are passing! That's bad!

Why? Because it means our tests don't care where we break the line, or at least they are not good enough to tell apart the two cases we just saw.

So, they need improving. Here is the new test:

34def test_fill_row_backtrack():
35    letters = [boxes.Box(letter="x", x=i, w=1) for i in range(100)]
36    for x in (5, 25, 40):
37        letters[x].letter = " "
38        letters[x].stretchy = True
39    page = boxes.Box(x=0, y=0, w=30, h=30)
40    row = boxes.fill_row(letters, page, 0)
41
42    # Because of backtracking, it breaks underfull
43    assert len(row) == 26
44    assert len(letters) == 74
45    assert row[-1].letter == " "

And remember: if you change things and the tests don't break, either you are missing tests, or the change is not meaningful.

Finally: a quick visual check that we are not breaking something else:

python code/lesson10/boxes.py ./pride-and-prejudice.txt lesson10.1.svg

lesson10.1.svg

With a clearer conscience, we can move onto fixing that ugly function.


results matching ""

    No results matching ""