215

I know that I can use something like cat test.txt | pr -w 80 to wrap lines to 80 characters wide, but that puts a lot of space on the top and bottom of the printed lines and it does not work right on some systems

What's the best way to force a text file with long lines to be wrapped at a certain width?

Bonus points if you can keep it from breaking words.

GKFX
  • 178
cwd
  • 45,389

7 Answers7

265

You are looking for

fold -w 80 -s text.txt
  • -w tells the width of the text, where 80 is standard.
  • -s tells to break at spaces, and not in words.

This is the standard way, but there are other systems, which need "-c" instead of "-w".

  • Works on OS X, too, but filename needs to be after args. Thanks! – rdrey Sep 02 '14 at 22:13
  • 3
    On a side note, to nicely format e-mails for text-only reply, I use: fold -s -w 80 email.txt | sed 's/^.*$/> &/' – Marcello Romani Feb 10 '15 at 21:10
  • 3
    @MarcelloRomani, shouldn't you use a width of 78 since you're prepending two characters? – nanny Feb 26 '15 at 14:59
  • 1
    Hmm... I guess so. Thanks for pointing that out :) – Marcello Romani Feb 27 '15 at 12:35
  • 3
    Is there something like fold that lets you specify a string to wrap on? – will Feb 01 '17 at 02:30
  • 5
    Note that fold breaks urls, while fmt does not. – Skippy le Grand Gourou Mar 28 '17 at 11:05
  • 1
    any idea what happens when the folded file is markdown and contains URLs that are longer than width specified. – Richard Dec 13 '19 at 12:32
  • 2
    Works nicely in mingw, but it leaves trailing whitespace at the end of some lines. Is there a neat way of fixing this? – Max Barraclough Feb 09 '20 at 16:32
  • 1
    @Richard if you're asking what it does with really long words, it would seem to force breaks on non-spaces where necessary. Try fold -w 5 -s <<< 123456 – mwfearnley Mar 12 '21 at 12:25
  • 1
    @MaxBarraclough fmt doesn't leave trailing whitespace. If you need to use fold instead of fmt, you can add a bit of Perl at the end to strip out trailing whitespace. fold -s -w 80 file.txt | perl -pe 's/ +$//' – Jonathan Jul 05 '22 at 12:28
  • limitation: fold -w80 -s fails on unicode text. better: pandoc input.txt -t plain --wrap=auto --columns=80. but pandoc modifies the text: strips xml tags, replaces ascii quotes with unicode quotes, ... see pandoc issue: add input format plain – milahu Oct 05 '23 at 08:54
  • Know a way to use fold but with a delimiter that I can create columns (not using column). I want to have multilined columns.. So almost using fold to set the width of columns with multiple lines within each of the column spaces – ikwyl6 Oct 25 '23 at 23:14
  • @milahu if you've found a Unicode solution, could you post a full answer? I've put up a bare-bones, Unicode-aware answer written in Raku (a.k.a. Perl6), and would like to compare output (See: https://unix.stackexchange.com/a/766277/227738 ). Thx! – jubilatious1 Jan 08 '24 at 07:04
80

In addition to fold, take a look at fmt. fmt tries to choose line breaks intelligently to make text look good. It doesn't break long words, rather it wraps only by spaces. It will also join adjacent lines, which is good for prose but bad for log files or other formatted text.

Jonathan
  • 1,304
52
$ cat shxp.txt

O, they have lived long on the alms-basket of words, I marvel thy
master hath not eaten thee for a word; for thou art not so long by the
head as honorificabilitudinitatibus: thou art easier swallowed than a
flap-dragon.

1​) Assured fixed line width with word breaking:

fold -w 20 <shxp.txt

O, they have lived l ong on the alms-bask et of words, I marve l thy master hath no t eaten thee for a w ord; for thou art no t so long by the hea d as honorificabilit udinitatibus: thou a rt easier swallowed than a flap-dragon.

2​) Assured fixed line width with extraordinary word breaking. A word gets broken only if it is too large to fit in a line:

fold -sw 20 <shxp.txt

O, they have lived long on the alms-basket of words, I marvel thy master hath not eaten thee for a word; for thou art not so long by the head as honorificabilitudini tatibus: thou art easier swallowed than a flap-dragon.

3​) Promising fixed line width without any word breaking. If word is too large to fit in a line, it is still left as it is, so finally some lines may be larger in size than you need:

fmt -w 20 <shxp.txt

O, they have lived long on the alms-basket of words, I marvel thy master hath not eaten thee for a word; for thou art not so long by the head as honorificabilitudinitatibus: thou art easier swallowed than a flap-dragon.

Note that fmt also tries to balance ragged paragraph lines unlike fold -s.

4) Perhaps, the most typographically sophisticated way of solving the problem due to a special markup language and formatting utility used under the hood of the man program. Great possibilities for additional customization:

2>/dev/null nroff <(echo .pl 1 ; echo .ll 20) shxp.txt

O, they have lived long on the alms‐ basket of words, I marvel thy master hath not eaten thee for a word; for thou art not so long by the head as honori‐ ficabilitudinitati‐ bus: thou art easier swallowed than a flap‐dragon.

.pl 1 roff markup sets the page height to a single line, effectively disabling pagination.

.ll 20 sets the line length to 20 characters.

Putting the markup in a separate file will simplify the command:

$ cat markup.roff
.pl 1
.ll 20
$ 2>/dev/null nroff markup.roff shxp.txt

In order for nroff to work with Unicode, the text can be pre-converted using preconv:

$ 2>/dev/null nroff markup.roff <(preconv shxp.txt)
  • 3
    Underrated answer. Available on most systems. Nice one. – Merc Oct 04 '16 at 02:09
  • 2
    I really appreciate seeing a real text example with different options. I have been trying to write a Python version of wrap, but was unsatisfied with the long-word handling. Having the wrap and the fwt option for longer-than-specified words is very nice. – bballdave025 Sep 04 '20 at 18:31
  • nroff looks nice in your example, but turns å into Ã¥ and mangles ansi colouring. Can the nroff thing be made to deal with unicode etc. or is this a hack where the input actually should be formatted like a man page source file? – unhammer Jan 03 '24 at 14:46
  • 1
    @unhammer Using preconv of the text makes nroff unicode enabled. – user2683246 Mar 22 '24 at 20:39
  • Nice, thanks @user2683246 . preconv+nroff seems like the winner for human-readable word wrapping with standard unix tools :) – unhammer Mar 24 '24 at 21:23
15

Another (less known) tool that does what you want is wrap from GNU Talkfilters:

wrap -w 80 < textfile

Also (off topic):

but that puts a lot of space on the top and bottom of the printed lines

add -t when invoking pr to omit headers/trailers:

   -t, --omit-header
          omit page headers and trailers
don_crissti
  • 82,805
11

And for more formatting options, look at par -- http://www.nicemice.net/par/

sendmoreinfo
  • 2,573
  • 5
    Currently the web site is down, there is the Internet Archive and Google's cache but still this shows why it's important to post more than just links, you could have at least posted the examples from the official documentation. – phk Dec 27 '16 at 16:31
1

Using Raku (formerly known as Perl_6)

[ Posting this because a number of U&L users have commented that some previous answers don't work with Unicode ].

Raku is a programming language in the Perl-family that features high-level support for Unicode. Raku normalizes all non-filename/non-filepath text to Normalization Form C (NFC) by default. Thus "graphemes, which are user-visible forms of the characters, will use a normalized representation" (i.e. normalized codepoints/width, see Unicode links at bottom for details).

Immediately below is an approach to solving the easier of the OP's requests (i.e. break text exactly at a desired column-width, irrespective of words/whitespace. The code is based on Raku's comb routine, and is written such that paragraphs (\n\n-separated or greater) are maintained separate with a single blank line in between. (Thanks to @user2683246 for the example text):

1. Break text/words at a desired column-width:

Sample Input:

~$ cat shxp_X2.txt
O, they have lived long on the alms-basket of words, I marvel thy
master hath not eaten thee for a word; for thou art not so long by the
head as honorificabilitudinitatibus: thou art easier swallowed than a
flap-dragon.

O, they have lived long on the alms-basket of words, I marvel thy master hath not eaten thee for a word; for thou art not so long by the head as honorificabilitudinitatibus: thou art easier swallowed than a flap-dragon.

Code with Sample Output (wrapped to <= 40 characters wide):

~$ raku -e 'my $wrap = 40; for slurp.split(/ \n**2..* /) { .subst(:global, / \n /, " ") andthen .put for $_.comb($wrap); put ""; };'   shxp_X2.txt
O, they have lived long on the alms-bask
et of words, I marvel thy master hath no
t eaten thee for a word; for thou art no
t so long by the head as honorificabilit
udinitatibus: thou art easier swallowed 
than a flap-dragon.

O, they have lived long on the alms-bask et of words, I marvel thy master hath no t eaten thee for a word; for thou art no t so long by the head as honorificabilit udinitatibus: thou art easier swallowed than a flap-dragon.



2. Break between words (i.e. on whitespace) at desired column-width:

The code immediately below uses Raku's words routine which breaks on whitespace. Below are example lines in over 30 Unicode Scripts, wrapped to <= 72 characters wide:

~$ raku -e 'my  $wrap = 72; my   $tmp = 0; 
            for lines() {   my $ln-ch = $_.chars;  
                if  $ln-ch == 0 { "\n".say; $tmp = 0; next };    
                for $_.words -> $w {   my  $w-ch = $w.chars;  
                    $wrap >=  ($tmp + $w-ch)        
                    ?? (   "$w".print andthen $tmp += $w-ch )  
                    !! ( "\n$w".print andthen $tmp  = $w-ch );  
                    if ($wrap > $tmp) { " ".print andthen ++$tmp };  
                }   
            };'   file

Sample Input (from The Kermit Project):

English: The quick brown fox jumps over the lazy dog.
Jamaican: Chruu, a kwik di kwik brong fox a jomp huova di liezi daag de, yu no siit?
Irish: "An ḃfuil do ċroí ag bualaḋ ó ḟaitíos an ġrá a ṁeall lena ṗóg éada ó ṡlí do leasa ṫú?" "D'ḟuascail Íosa Úrṁac na hÓiġe Beannaiṫe pór Éava agus Áḋaiṁ."
Dutch: Pa's wijze lynx bezag vroom het fikse aquaduct.
German: Falsches Üben von Xylophonmusik quält jeden größeren Zwerg. (1)
German: Im finſteren Jagdſchloß am offenen Felsquellwaſſer patzte der affig-flatterhafte kauzig-höf‌liche Bäcker über ſeinem verſifften kniffligen C-Xylophon. (2)
Norwegian: Blåbærsyltetøy ("blueberry jam", includes every extra letter used in Norwegian).
Swedish: Flygande bäckasiner söka strax hwila på mjuka tuvor.
Icelandic: Sævör grét áðan því úlpan var ónýt.
Finnish: (5) Törkylempijävongahdus (This is a perfect pangram, every letter appears only once. Translating it is an art on its own, but I'll say "rude lover's yelp". :-D)
Finnish: (5) Albert osti fagotin ja töräytti puhkuvan melodian. (Albert bought a bassoon and hooted an impressive melody.)
Finnish: (5) On sangen hauskaa, että polkupyörä on maanteiden jokapäiväinen ilmiö. (It's pleasantly amusing, that the bicycle is an everyday sight on the roads.)
Polish: Pchnąć w tę łódź jeża lub osiem skrzyń fig.
Czech: Příliš žluťoučký kůň úpěl ďábelské ódy.
Slovak: Starý kôň na hŕbe kníh žuje tíško povädnuté ruže, na stĺpe sa ďateľ učí kvákať novú ódu o živote.
Slovenian: Šerif bo za domačo vajo spet kuhal žgance.
Greek (monotonic): ξεσκεπάζω την ψυχοφθόρα βδελυγμία
Greek (polytonic): ξεσκεπάζω τὴν ψυχοφθόρα βδελυγμία
Russian: Съешь же ещё этих мягких французских булок да выпей чаю.
Russian: В чащах юга жил-был цитрус? Да, но фальшивый экземпляр! ёъ.
Bulgarian: Жълтата дюля беше щастлива, че пухът, който цъфна, замръзна като гьон.
Sami (Northern): Vuol Ruoŧa geđggiid leat máŋga luosa ja čuovžža.
Hungarian: Árvíztűrő tükörfúrógép.
Spanish: El pingüino Wenceslao hizo kilómetros bajo exhaustiva lluvia y frío, añoraba a su querido cachorro.
Spanish: Volé cigüeña que jamás cruzó París, exhibe flor de kiwi y atún.
Portuguese: O próximo vôo à noite sobre o Atlântico, põe freqüentemente o único médico. (3)
French: Les naïfs ægithales hâtifs pondant à Noël où il gèle sont sûrs d'être déçus en voyant leurs drôles d'œufs abîmés.
Esperanto: Eĥoŝanĝo ĉiuĵaŭde
Esperanto: Laŭ Ludoviko Zamenhof bongustas freŝa ĉeĥa manĝaĵo kun spicoj.
Hebrew: זה כיף סתם לשמוע איך תנצח קרפד עץ טוב בגן.
Japanese (Hiragana):
いろはにほへど ちりぬるを
わがよたれぞ つねならむ
うゐのおくやま けふこえて
あさきゆめみじ ゑひもせず (4)
Japanese (Kanji):
色は匂へど 散りぬるを
我が世誰ぞ 常ならむ
有為の奥山 今日越えて
浅き夢見じ 酔ひもせず

Sample Output (wrapped to 72 characters):

English: The quick brown fox jumps over the lazy dog. Jamaican: Chruu, a
kwik di kwik brong fox a jomp huova di liezi daag de, yu no siit? Irish:
"An ḃfuil do ċroí ag bualaḋ ó ḟaitíos an ġrá a ṁeall lena ṗóg éada ó ṡlí
do leasa ṫú?" "D'ḟuascail Íosa Úrṁac na hÓiġe Beannaiṫe pór Éava agus
Áḋaiṁ." Dutch: Pa's wijze lynx bezag vroom het fikse aquaduct. German:
Falsches Üben von Xylophonmusik quält jeden größeren Zwerg. (1) German:
Im finſteren Jagdſchloß am offenen Felsquellwaſſer patzte der
affig-flatterhafte kauzig-höf‌liche Bäcker über ſeinem verſifften
kniffligen C-Xylophon. (2) Norwegian: Blåbærsyltetøy ("blueberry jam",
includes every extra letter used in Norwegian). Swedish: Flygande
bäckasiner söka strax hwila på mjuka tuvor. Icelandic: Sævör grét áðan
því úlpan var ónýt. Finnish: (5) Törkylempijävongahdus (This is a
perfect pangram, every letter appears only once. Translating it is an
art on its own, but I'll say "rude lover's yelp". :-D) Finnish: (5)
Albert osti fagotin ja töräytti puhkuvan melodian. (Albert bought a
bassoon and hooted an impressive melody.) Finnish: (5) On sangen
hauskaa, että polkupyörä on maanteiden jokapäiväinen ilmiö. (It's
pleasantly amusing, that the bicycle is an everyday sight on the roads.)
Polish: Pchnąć w tę łódź jeża lub osiem skrzyń fig. Czech: Příliš
žluťoučký kůň úpěl ďábelské ódy. Slovak: Starý kôň na hŕbe kníh žuje
tíško povädnuté ruže, na stĺpe sa ďateľ učí kvákať novú ódu o živote.
Slovenian: Šerif bo za domačo vajo spet kuhal žgance. Greek (monotonic):
ξεσκεπάζω την ψυχοφθόρα βδελυγμία Greek (polytonic): ξεσκεπάζω τὴν
ψυχοφθόρα βδελυγμία Russian: Съешь же ещё этих мягких французских булок
да выпей чаю. Russian: В чащах юга жил-был цитрус? Да, но фальшивый
экземпляр! ёъ. Bulgarian: Жълтата дюля беше щастлива, че пухът, който
цъфна, замръзна като гьон. Sami (Northern): Vuol Ruoŧa geđggiid leat
máŋga luosa ja čuovžža. Hungarian: Árvíztűrő tükörfúrógép. Spanish: El
pingüino Wenceslao hizo kilómetros bajo exhaustiva lluvia y frío,
añoraba a su querido cachorro. Spanish: Volé cigüeña que jamás cruzó
París, exhibe flor de kiwi y atún. Portuguese: O próximo vôo à noite
sobre o Atlântico, põe freqüentemente o único médico. (3) French: Les
naïfs ægithales hâtifs pondant à Noël où il gèle sont sûrs d'être déçus
en voyant leurs drôles d'œufs abîmés. Esperanto: Eĥoŝanĝo ĉiuĵaŭde
Esperanto: Laŭ Ludoviko Zamenhof bongustas freŝa ĉeĥa manĝaĵo kun
spicoj. Hebrew: זה כיף סתם לשמוע איך תנצח קרפד עץ טוב בגן. Japanese
(Hiragana): いろはにほへど ちりぬるを わがよたれぞ つねならむ うゐのおくやま けふこえて あさきゆめみじ ゑひもせず (4)
Japanese (Kanji): 色は匂へど 散りぬるを 我が世誰ぞ 常ならむ 有為の奥山 今日越えて 浅き夢見じ 酔ひもせず
  • Paragraphs (\n\n-separated or greater) are maintained separate with a single blank line in between. All lines in the Sample Output wrap to 72 characters or less. The only visual problem is with Japanese Hiragana/Kanji, but in fact the last two lines of the "wrapped" output contain 71 and 65 characters, respectively.

  • Custom words can be defined, based upon Unicode properties. For example, the .words routine can be replaced by .comb(/ <-:Zs>+ /) to split on Unicode 'Space-Separator' as defined in Unicode® Standard Annex #44.

  • Right now the code doesn't hyphenate or otherwise break individual words that are longer than the desired $wrap column width. (This may be the desired behavior, otherwise you indeed might see issues with excessively long words and/or short column-widths).

  • A single trailing whitespace is left at the end of lines less that $wrap. This can be corrected by running ~$ raku -ne '.trim-trailing.put;' over the wrapped output.


https://unicode.org/reports/tr15/#Canon_Compat_Equivalence
https://docs.raku.org/language/unicode
https://docs.raku.org/type/Str#routine_words
https://docs.raku.org/type/Str#routine_comb
https://raku.org

jubilatious1
  • 3,195
  • 8
  • 17
  • instead of testing different human languages, it would make more sense to test different unicode whitespace characters – milahu Jan 08 '24 at 08:42
  • @mihalu Edited, thanks. Raku's .words routine is basically the same as (Unicode-aware) $input.comb(/ \S+ /, $limit) where \S+ is one-or-more non-whitespace character and $limit equals Inf. So Raku .combs on the Unicode definition of whitespace (.comb is essentially the inverse of .split). If a user needs to create their own .words definition then they can use Unicode properties to .comb on a delimiter of their choice. Cheers. – jubilatious1 Jan 08 '24 at 10:20
1

pandoc can wrap unicode text

pandoc -f plain.lua -t plain \
  --wrap=auto --columns=78 input.txt

you only need a plain text reader in plain.lua
because by default, pandoc cannot parse plain text

-- A sample custom reader that just parses text into blankline-separated
-- paragraphs with space-separated words.

-- For better performance we put these functions in local variables: local P, S, R, Cf, Cc, Ct, V, Cs, Cg, Cb, B, C, Cmt = lpeg.P, lpeg.S, lpeg.R, lpeg.Cf, lpeg.Cc, lpeg.Ct, lpeg.V, lpeg.Cs, lpeg.Cg, lpeg.Cb, lpeg.B, lpeg.C, lpeg.Cmt

local whitespacechar = S(" \t\r\n") local wordchar = (1 - whitespacechar) local spacechar = S(" \t") local newline = P"\r"^-1 * P"\n" local blanklines = newline * (spacechar^0 * newline)^1 local endline = newline - blanklines

-- Grammar G = P{ "Pandoc", Pandoc = Ct(V"Block"^0) / pandoc.Pandoc; Block = blanklines^0 * V"Para" ; Para = Ct(V"Inline"^1) / pandoc.Para; Inline = V"Str" + V"Space" + V"SoftBreak" ; Str = wordchar^1 / pandoc.Str; Space = spacechar^1 / pandoc.Space; SoftBreak = endline / pandoc.SoftBreak; }

function Reader(input) return lpeg.match(G, tostring(input)) end

milahu
  • 208