BOXES v0.10

In our previous lesson we created a serviceable text layout engine. It has many problems, but remember our goal is not to create the best possible thing, this is an educational experience. The spit and polish will appear later on.

But there is a glaring problem, it breaks words in all the wrong places. Examples of it appear in almost every line of the output. So, how does one fix that?

The traditional answer (and the one we will be using) is hyphenation, breaking words between lines in the correct places.

Instead of breaking anywhere, we will break only in the places where the rules of each language allow us to.

Just as it happened with text shaping we are lucky to live in a moment in time when almost everything we need to do it right is already in place. In particular, we will use a library called Pyphen mostly because I already have used it in another project.

Am I sure it's the best one? No. Do I know exactly how it does what it does? No. I know enough to make it work, and it works well enough so for this stage in the life of this project that is more than enough. In fact, it takes the rules for word-breaking from dictionaries provided by an Office Suite, so it does about as good a job as the dictionary does. It even supports subtleties such as the differences between British and American English!

Here's an example of how it works:

1import pyphen
2dic = pyphen.Pyphen(lang='en_GB')
3print('en_GB:', dic.inserted('dictionary', '-'))
4dic = pyphen.Pyphen(lang='en_US')
5print('en_US:', dic.inserted('dictionary', '-'))
en_GB: dic-tion-ary
en_US: dic-tio-nary

Keep in mind that this is not magic. If you feed it garbage, it will give you garbage.

1dic = pyphen.Pyphen(lang='es_ES')
2print('es_ES:', dic.inserted('dictionary', '-'))
es_ES: dic-tio-na-ry

Where is it proper to break a line?

  • On a newline character
  • On a space
  • On a breaking point as defined by Pyphen

One of those things is not like the others. We have boxes with newlines in them and we have boxes with spaces in them, but there are no boxes with breaking points in them.

But we can add them! There is Unicode symbol for that:

SOFT HYPHEN (SHY)

The soft hyphen serves as an invisible marker used to specify a place in text where a hyphenated break is allowed without forcing a line break in an inconvenient place if the text is re-flowed. It becomes visible only after word wrapping at the end of a line.

So, if we insert them in all the right places, then we can use them to decide whether we are at a suitable breaking point. We will put this in a file called hyphen.py

 1import pyphen
 2
 3dic = pyphen.Pyphen(lang="en_US")
 4
 5
 6def insert_soft_hyphens(text, hyphen="\xad"):
 7    """Insert the hyphen in breaking pointsaccording to the dictionary.
 8
 9    '\xad' is the Soft Hyphen (SHY) character
10    """
11    lines = []
12    for line in text.splitlines():
13        hyph_words = [
14            dic.inserted(word, hyphen) for word in line.split()
15        ]
16        lines.append(" ".join(hyph_words))
17    return "\n".join(lines)
1from hyphen import insert_soft_hyphens
2
3print (insert_soft_hyphens('Roses are red\nViolets are blue', '-'))
Ros-es are red
Vi-o-lets are blue

So, with this code ready, we can get to work on implementing hyphenation support in our layout function.

First, this code is exactly as it was before:

 1from fonts import adjust_widths_by_letter
 2from hyphen import insert_soft_hyphens
 3
 4
 5class Box():
 6
 7    def __init__(self, x=0, y=0, w=1, h=1, stretchy=False, letter="x"):
 8        """Accept arguments to define our box, and store them."""
 9        self.x = x
10        self.y = y
11        self.w = w
12        self.h = h
13        self.stretchy = stretchy
14        self.letter = letter
15
16    def __repr__(self):
17        return 'Box(%s, %s, %s, %s, "%s")' % (
18            self.x, self.y, self.w, self.h, self.letter
19        )

We do need to make a small change to how we load our text, to add the hyphens:

22p_and_p = open("pride-and-prejudice.txt").read()
23p_and_p = insert_soft_hyphens(p_and_p)  # Insert invisible hyphens
24text_boxes = []
25for l in p_and_p:
26    text_boxes.append(Box(letter=l, stretchy=l == " "))
27adjust_widths_by_letter(text_boxes)
28
29# A few pages all the same size
30pages = [Box(i * 35, 0, 30, 50) for i in range(10)]

No changes in how we draw things.

147import svgwrite
148
149
150def draw_boxes(boxes, fname, size, hide_boxes=False):
151    dwg = svgwrite.Drawing(fname, profile="full", size=size)
152    # Draw the pages
153    for page in pages:
154        dwg.add(
155            dwg.rect(
156                insert=(f"{page.x}cm", f"{page.y}cm"),
157                size=(f"{page.w}cm", f"{page.h}cm"),
158                fill="lightblue",
159            )
160        )
161    # Draw all the boxes
162    for box in boxes:
163        # The box color depends on its features
164        color = "green" if box.stretchy else "red"
165        # Make the colored boxes optional
166        if not hide_boxes:
167            dwg.add(
168                dwg.rect(
169                    insert=(f"{box.x}cm", f"{box.y}cm"),
170                    size=(f"{box.w}cm", f"{box.h}cm"),
171                    fill=color,
172                )
173            )
174        # Display the letter in the box
175        if box.letter:
176            dwg.add(
177                dwg.text(
178                    box.letter,
179                    insert=(f"{box.x}cm", f"{box.y + box.h}cm"),
180                    font_size=f"{box.h}cm",
181                    font_family="Arial",
182                )
183            )
184    dwg.save()

And now our layout function. One first approach, which we will refine later, is to simply refuse to break lines if we are not in a "good" place to break it.

Then, we inject a box with a visible hyphen in the line break, and that's it.

Here is the code to create a box with a hyphen:

33def hyphenbox():
34    b = Box(letter="-")
35    adjust_widths_by_letter([b])
36    return b

And here finally, our layout supports hyphens:

 39# We add a "separation" constant so you can see the boxes individually
 40separation = .05
 41
 42
 43def layout(_boxes):
 44    """Layout boxes along pages.
 45
 46    Keep in mind that this function modifies the boxes themselves, so
 47    you should be very careful about trying to call layout() more than once
 48    on the same boxes.
 49
 50    Specifically, some spaces will become 0-width and not stretchy.
 51    """
 52
 53    # Because we modify the box list, we will work on a copy
 54    boxes = _boxes[:]
 55    # We start at page 0
 56    page = 0
 57    # The 1st box should be placed in the correct page
 58    previous = boxes.pop(0)
 59    previous.x = pages[page].x
 60    previous.y = pages[page].y
 61    row = []
 62    while boxes:
 63        # We take the new 1st box
 64        box = boxes.pop(0)
 65        # And put it next to the other
 66        box.x = previous.x + previous.w + separation
 67        # At the same vertical location
 68        box.y = previous.y
 69
 70        # Handle breaking on newlines
 71        break_line = False
 72        # But if it's a newline
 73        if (box.letter == "\n"):
 74            break_line = True
 75            # Newlines take no horizontal space ever
 76            box.w = 0
 77            box.stretchy = False
 78
 79        # Or if it's too far to the right, and is a
 80        # good place to break the line...
 81        elif (box.x + box.w) > (
 82            pages[page].x + pages[page].w
 83        ) and box.letter in (
 84            " ", "\xad"
 85        ):
 86            if box.letter == "\xad":
 87                # Add a visible hyphen in the row
 88                h_b = hyphenbox()
 89                h_b.x = previous.x + previous.w + separation
 90                h_b.y = previous.y
 91                _boxes.append(h_b)  # So it's drawn
 92                row.append(h_b)  # So it's justified
 93            break_line = True
 94            # We adjust the row
 95            # Remove all right-margin spaces
 96            while row[-1].letter == " ":
 97                row.pop()
 98            slack = (pages[page].x + pages[page].w) - (
 99                row[-1].x + row[-1].w
100            )
101            # Get a list of all the ones that are stretchy
102            stretchies = [b for b in row if b.stretchy]
103            if not stretchies:  # Nothing stretches do as before.
104                bump = slack / len(row)
105                # The 1st box gets 0 bumps, the 2nd gets 1 and so on
106                for i, b in enumerate(row):
107                    b.x += bump * i
108            else:
109                bump = slack / len(stretchies)
110                # Each stretchy gets wider
111                for b in stretchies:
112                    b.w += bump
113                # And we put each thing next to the previous one
114                for j, b in enumerate(row[1:], 1):
115                    b.x = row[j - 1].x + row[j - 1].w + separation
116
117        if break_line:
118            # We start a new row
119            row = []
120            # We go all the way left and a little down
121            box.x = pages[page].x
122            box.y = previous.y + previous.h + separation
123
124        # But if we go too far down
125        if box.y + box.h > pages[page].y + pages[page].h:
126            # We go to the next page
127            page += 1
128            # And put the box at the top-left
129            box.x = pages[page].x
130            box.y = pages[page].y
131
132        # Put the box in the row
133        row.append(box)
134
135        # Collapse all left-margin space
136        if all(b.letter == " " for b in row):
137            box.w = 0
138            box.stretchy = False
139            box.x = pages[page].x
140
141        previous = box
142
143
144layout(text_boxes)
187draw_boxes(
188    text_boxes, "lesson10.svg", ("30cm", "50cm"), hide_boxes=True
189)

lesson10.svg

And there in "proper-ty" you can see it in action. Of course this is a naïve implementation. What happens if you just can't break?

1many_boxes = [Box(letter='a') for i in range(200)]
2adjust_widths_by_letter(many_boxes)
3layout(many_boxes)
4draw_boxes(many_boxes, 'lesson10_lots_of_a.svg', ("35cm", "6cm"), hide_boxes=True)

lesson10_lots_of_a.svg

Since it can't break at all, it just goes on and on.

And there are other corner cases!

1many_boxes = [Box(letter='a') for i in range(200)]
2many_boxes[100] = Box(letter=' ', stretchy=True)
3adjust_widths_by_letter(many_boxes)
4layout(many_boxes)
5draw_boxes(many_boxes, 'lesson10_one_break.svg', ("35cm", "6cm"), hide_boxes=True)

lesson10_one_break.svg

Because there is only one place to break the line, it then tries to wedge 100 letter "a" where there is room for 54 (I counted!) and something interesting happens... the "slack" is negative!

Instead of stretching out a "underfilled" line, we are squeezing a "overfilled" one. Everything gets packed too tight, and the letters start overlapping one another.

The lesson is that just because it works for the usual case it doesn't mean it's done. Even in the case of words, it can happen that breaking points take a while to appear and our line becomes overfull.

We will tackle that problem next.


Further references:

results matching ""

    No results matching ""