Saturday, February 16, 2008

Hard to Soft Formatting

Sometimes I have a document that uses manual line breaks or hard formatting to limit line lengths to x characters and I want to change it to a more word processor friendly format that uses paragraph breaks to delimit paragraphs, not lines.

This recent example is from a very decent walk through for the PS2 game "Hulk". It has text like this, where each line is no more than 80 characters long.


The Hulk is one of my most favorite comic book characters ever.  I remember 
watching the 80's series when I was younger and collecting a few of his 
comics.  But enough about that, lets get to the game.  The Hulk (GC, PS2, X-
box) has got to be the best Hulk game ever made!  Finally we have a worthy 
successor to The Incredible Hulk (Genesis, SNES).  The Hulk is the first game 
that gives you the impression that you actually are the Hulk.  From smashing 
down doorways to making your own door, The Hulk involves a path of endless 
destruction from the beginning to the end.

I hope this FAQ will answer any questions that you might have about The Hulk.  
I have tried to make the FAQ as detailed as possible.  I have stayed away from 
anything dealing with the storyline or cutscenes to avoid spoilers, but I 
cannot guarantee that this FAQ is totally spoiler-free.  So be sure to use the 
FAQ as a last resort if you are playing through for the first time. 

When I paste this into MS Word, it means every line ends with a paragraph break. I want to format the document so that paragraph breaks are at the end of logical paragraphs of text only, not at the end of each line. This way, I can turn a 40 page document down to 8 pages of monospaced text.

  1. Copy from the web page and paste into the Word document as unformatted text.
  2. Apply wicked formatting: 8 pt Courier New, 66% character spacing on an A4 page with two columns 0.5 cm apart, with top, bottom, left and right margins of 1cm.
  3. Get rid of unnecessary spaces. Find ("  " - two spaces) and replace with (" " - one space). Note: don't use this if the document has useful tables or ASCII art that uses space for layout.
  4. Get rid of more unnecessary spaces before paragraph marks i.e. at the end of lines. Find using wildcards (" [^13]" - space then a paragraph mark) and replace with ("^p") - a paragraph mark.
  5. Get rid of unnecessary paragraph marks. Find ("^p^p" - two paragraph marks) and replace with ("^p" - one paragraph mark).
  6. Get rid of paragraph marks that are used only to delimit lines. The assumption is that any line ending with a letter where the next line begins with letter can be safely replaced with "letter letter" i.e. replace the paragraph mark with a space. Find using wildcards ("([A-Za-z])[^13]([A-Za-z])") and replace with ("\1 \2"). Since I can't really tell if a period at the end of a line indicates the end of a sentence or the end of a paragraph, I leave those alone.

The above text will now take up substantially less column centimetres.

This demonstrates the following.

  • Word's Find and Replace can be used with regular expressions - when you use Wildcards in the Find and Replace dialog .
  • To find a paragraph mark when you use wildcards, use the control code " [^13]" in the "Find" field. In the "Replace" field, you will still use "^p".
  • Character ranges can be defined between square brackets, such as "[A-Za-z]".
  • Round braces - "()" - can be used to mark wild card sequences in the "Find" field and back references - "\x" - can be used address those wild card sequence in the "Replace" field. For example, find using wildcards ("(Robert) (Bram)") and replace with ("\2 \1") would result in "Bram Robert".

Some pages I found useful when working this out.


Anonymous said...

You're awesome! thank you so much. This is exactly what I have been trying to figure out how to do--how to get rid of all the extra spaces and paragraph marks that occur when I copy and paste text onto a Word document. I've spent hours manually formatting this stuff and now you just showed me how to do it in an instant. You're me hero! thanks!

RobertMarkBram said...

Heh, glad it helped you! Thanks for the props. :)