Using Raku (formerly known as Perl_6)
[ Posting this because a number of U&L users have commented that some previous answers don't work with Unicode ].
Raku is a programming language in the Perl-family that features high-level support for Unicode. Raku normalizes all non-filename/non-filepath text to Normalization Form C (NFC) by default. Thus "graphemes, which are user-visible forms of the characters, will use a normalized representation" (i.e. normalized codepoints/width, see Unicode links at bottom for details).
Immediately below is an approach to solving the easier of the OP's requests (i.e. break text exactly at a desired column-width, irrespective of words/whitespace. The code is based on Raku's comb
routine, and is written such that paragraphs (\n\n
-separated or greater) are maintained separate with a single blank line in between. (Thanks to @user2683246 for the example text):
1. Break text/words at a desired column-width:
Sample Input:
~$ cat shxp_X2.txt
O, they have lived long on the alms-basket of words, I marvel thy
master hath not eaten thee for a word; for thou art not so long by the
head as honorificabilitudinitatibus: thou art easier swallowed than a
flap-dragon.
O, they have lived long on the alms-basket of words, I marvel thy
master hath not eaten thee for a word; for thou art not so long by the
head as honorificabilitudinitatibus: thou art easier swallowed than a
flap-dragon.
Code with Sample Output (wrapped to <= 40 characters wide):
~$ raku -e 'my $wrap = 40; for slurp.split(/ \n**2..* /) { .subst(:global, / \n /, " ") andthen .put for $_.comb($wrap); put ""; };' shxp_X2.txt
O, they have lived long on the alms-bask
et of words, I marvel thy master hath no
t eaten thee for a word; for thou art no
t so long by the head as honorificabilit
udinitatibus: thou art easier swallowed
than a flap-dragon.
O, they have lived long on the alms-bask
et of words, I marvel thy master hath no
t eaten thee for a word; for thou art no
t so long by the head as honorificabilit
udinitatibus: thou art easier swallowed
than a flap-dragon.
2. Break between words (i.e. on whitespace) at desired column-width:
The code immediately below uses Raku's words
routine which breaks on whitespace. Below are example lines in over 30 Unicode Scripts, wrapped to <= 72 characters wide:
~$ raku -e 'my $wrap = 72; my $tmp = 0;
for lines() { my $ln-ch = $_.chars;
if $ln-ch == 0 { "\n".say; $tmp = 0; next };
for $_.words -> $w { my $w-ch = $w.chars;
$wrap >= ($tmp + $w-ch)
?? ( "$w".print andthen $tmp += $w-ch )
!! ( "\n$w".print andthen $tmp = $w-ch );
if ($wrap > $tmp) { " ".print andthen ++$tmp };
}
};' file
Sample Input (from The Kermit Project):
English: The quick brown fox jumps over the lazy dog.
Jamaican: Chruu, a kwik di kwik brong fox a jomp huova di liezi daag de, yu no siit?
Irish: "An ḃfuil do ċroí ag bualaḋ ó ḟaitíos an ġrá a ṁeall lena ṗóg éada ó ṡlí do leasa ṫú?" "D'ḟuascail Íosa Úrṁac na hÓiġe Beannaiṫe pór Éava agus Áḋaiṁ."
Dutch: Pa's wijze lynx bezag vroom het fikse aquaduct.
German: Falsches Üben von Xylophonmusik quält jeden größeren Zwerg. (1)
German: Im finſteren Jagdſchloß am offenen Felsquellwaſſer patzte der affig-flatterhafte kauzig-höfliche Bäcker über ſeinem verſifften kniffligen C-Xylophon. (2)
Norwegian: Blåbærsyltetøy ("blueberry jam", includes every extra letter used in Norwegian).
Swedish: Flygande bäckasiner söka strax hwila på mjuka tuvor.
Icelandic: Sævör grét áðan því úlpan var ónýt.
Finnish: (5) Törkylempijävongahdus (This is a perfect pangram, every letter appears only once. Translating it is an art on its own, but I'll say "rude lover's yelp". :-D)
Finnish: (5) Albert osti fagotin ja töräytti puhkuvan melodian. (Albert bought a bassoon and hooted an impressive melody.)
Finnish: (5) On sangen hauskaa, että polkupyörä on maanteiden jokapäiväinen ilmiö. (It's pleasantly amusing, that the bicycle is an everyday sight on the roads.)
Polish: Pchnąć w tę łódź jeża lub osiem skrzyń fig.
Czech: Příliš žluťoučký kůň úpěl ďábelské ódy.
Slovak: Starý kôň na hŕbe kníh žuje tíško povädnuté ruže, na stĺpe sa ďateľ učí kvákať novú ódu o živote.
Slovenian: Šerif bo za domačo vajo spet kuhal žgance.
Greek (monotonic): ξεσκεπάζω την ψυχοφθόρα βδελυγμία
Greek (polytonic): ξεσκεπάζω τὴν ψυχοφθόρα βδελυγμία
Russian: Съешь же ещё этих мягких французских булок да выпей чаю.
Russian: В чащах юга жил-был цитрус? Да, но фальшивый экземпляр! ёъ.
Bulgarian: Жълтата дюля беше щастлива, че пухът, който цъфна, замръзна като гьон.
Sami (Northern): Vuol Ruoŧa geđggiid leat máŋga luosa ja čuovžža.
Hungarian: Árvíztűrő tükörfúrógép.
Spanish: El pingüino Wenceslao hizo kilómetros bajo exhaustiva lluvia y frío, añoraba a su querido cachorro.
Spanish: Volé cigüeña que jamás cruzó París, exhibe flor de kiwi y atún.
Portuguese: O próximo vôo à noite sobre o Atlântico, põe freqüentemente o único médico. (3)
French: Les naïfs ægithales hâtifs pondant à Noël où il gèle sont sûrs d'être déçus en voyant leurs drôles d'œufs abîmés.
Esperanto: Eĥoŝanĝo ĉiuĵaŭde
Esperanto: Laŭ Ludoviko Zamenhof bongustas freŝa ĉeĥa manĝaĵo kun spicoj.
Hebrew: זה כיף סתם לשמוע איך תנצח קרפד עץ טוב בגן.
Japanese (Hiragana):
いろはにほへど ちりぬるを
わがよたれぞ つねならむ
うゐのおくやま けふこえて
あさきゆめみじ ゑひもせず (4)
Japanese (Kanji):
色は匂へど 散りぬるを
我が世誰ぞ 常ならむ
有為の奥山 今日越えて
浅き夢見じ 酔ひもせず
Sample Output (wrapped to 72 characters):
English: The quick brown fox jumps over the lazy dog. Jamaican: Chruu, a
kwik di kwik brong fox a jomp huova di liezi daag de, yu no siit? Irish:
"An ḃfuil do ċroí ag bualaḋ ó ḟaitíos an ġrá a ṁeall lena ṗóg éada ó ṡlí
do leasa ṫú?" "D'ḟuascail Íosa Úrṁac na hÓiġe Beannaiṫe pór Éava agus
Áḋaiṁ." Dutch: Pa's wijze lynx bezag vroom het fikse aquaduct. German:
Falsches Üben von Xylophonmusik quält jeden größeren Zwerg. (1) German:
Im finſteren Jagdſchloß am offenen Felsquellwaſſer patzte der
affig-flatterhafte kauzig-höfliche Bäcker über ſeinem verſifften
kniffligen C-Xylophon. (2) Norwegian: Blåbærsyltetøy ("blueberry jam",
includes every extra letter used in Norwegian). Swedish: Flygande
bäckasiner söka strax hwila på mjuka tuvor. Icelandic: Sævör grét áðan
því úlpan var ónýt. Finnish: (5) Törkylempijävongahdus (This is a
perfect pangram, every letter appears only once. Translating it is an
art on its own, but I'll say "rude lover's yelp". :-D) Finnish: (5)
Albert osti fagotin ja töräytti puhkuvan melodian. (Albert bought a
bassoon and hooted an impressive melody.) Finnish: (5) On sangen
hauskaa, että polkupyörä on maanteiden jokapäiväinen ilmiö. (It's
pleasantly amusing, that the bicycle is an everyday sight on the roads.)
Polish: Pchnąć w tę łódź jeża lub osiem skrzyń fig. Czech: Příliš
žluťoučký kůň úpěl ďábelské ódy. Slovak: Starý kôň na hŕbe kníh žuje
tíško povädnuté ruže, na stĺpe sa ďateľ učí kvákať novú ódu o živote.
Slovenian: Šerif bo za domačo vajo spet kuhal žgance. Greek (monotonic):
ξεσκεπάζω την ψυχοφθόρα βδελυγμία Greek (polytonic): ξεσκεπάζω τὴν
ψυχοφθόρα βδελυγμία Russian: Съешь же ещё этих мягких французских булок
да выпей чаю. Russian: В чащах юга жил-был цитрус? Да, но фальшивый
экземпляр! ёъ. Bulgarian: Жълтата дюля беше щастлива, че пухът, който
цъфна, замръзна като гьон. Sami (Northern): Vuol Ruoŧa geđggiid leat
máŋga luosa ja čuovžža. Hungarian: Árvíztűrő tükörfúrógép. Spanish: El
pingüino Wenceslao hizo kilómetros bajo exhaustiva lluvia y frío,
añoraba a su querido cachorro. Spanish: Volé cigüeña que jamás cruzó
París, exhibe flor de kiwi y atún. Portuguese: O próximo vôo à noite
sobre o Atlântico, põe freqüentemente o único médico. (3) French: Les
naïfs ægithales hâtifs pondant à Noël où il gèle sont sûrs d'être déçus
en voyant leurs drôles d'œufs abîmés. Esperanto: Eĥoŝanĝo ĉiuĵaŭde
Esperanto: Laŭ Ludoviko Zamenhof bongustas freŝa ĉeĥa manĝaĵo kun
spicoj. Hebrew: זה כיף סתם לשמוע איך תנצח קרפד עץ טוב בגן. Japanese
(Hiragana): いろはにほへど ちりぬるを わがよたれぞ つねならむ うゐのおくやま けふこえて あさきゆめみじ ゑひもせず (4)
Japanese (Kanji): 色は匂へど 散りぬるを 我が世誰ぞ 常ならむ 有為の奥山 今日越えて 浅き夢見じ 酔ひもせず
Paragraphs (\n\n
-separated or greater) are maintained separate with a single blank line in between. All lines in the Sample Output wrap to 72 characters or less. The only visual problem is with Japanese Hiragana/Kanji, but in fact the last two lines of the "wrapped" output contain 71 and 65 characters, respectively.
Custom words
can be defined, based upon Unicode properties. For example, the .words
routine can be replaced by .comb(/ <-:Zs>+ /)
to split on Unicode 'Space-Separator' as defined in Unicode® Standard Annex #44.
Right now the code doesn't hyphenate or otherwise break individual words that are longer than the desired $wrap
column width. (This may be the desired behavior, otherwise you indeed might see issues with excessively long words and/or short column-widths).
A single trailing whitespace is left at the end of lines less that $wrap
. This can be corrected by running ~$ raku -ne '.trim-trailing.put;'
over the wrapped output.
https://unicode.org/reports/tr15/#Canon_Compat_Equivalence
https://docs.raku.org/language/unicode
https://docs.raku.org/type/Str#routine_words
https://docs.raku.org/type/Str#routine_comb
https://raku.org
fold -s -w 80 email.txt | sed 's/^.*$/> &/'
– Marcello Romani Feb 10 '15 at 21:10fold
that lets you specify a string to wrap on? – will Feb 01 '17 at 02:30fold
breaks urls, whilefmt
does not. – Skippy le Grand Gourou Mar 28 '17 at 11:05fold -w 5 -s <<< 123456
– mwfearnley Mar 12 '21 at 12:25fmt
doesn't leave trailing whitespace. If you need to usefold
instead offmt
, you can add a bit of Perl at the end to strip out trailing whitespace.fold -s -w 80 file.txt | perl -pe 's/ +$//'
– Jonathan Jul 05 '22 at 12:28fold -w80 -s
fails on unicode text. better:pandoc input.txt -t plain --wrap=auto --columns=80
. but pandoc modifies the text: strips xml tags, replaces ascii quotes with unicode quotes, ... see pandoc issue: add input format plain – milahu Oct 05 '23 at 08:54column
). I want to have multilined columns.. So almost using fold to set the width of columns with multiple lines within each of the column spaces – ikwyl6 Oct 25 '23 at 23:14