-1

I am processing a big list of domains to convert to IDN with the following command:

cat list | idn > clean

list format example:

президент.рф
mañana.com
bücher.com
café.fr
cliché.com
hualañe.cl
köln-düsseldorfer-rhein-main.de
mūsųlaikas.lt
sendesık.com
sushicorner-würzburg.de
domain.com
# almost 1 M lines

But I get the following message

idn: idna_to_ascii_4z (big list): Output would be too big or too small

Then I must make sure that my list does not exceed the allowed limit (too big or too small)

I found this:

RFC 1035 the length of a FQDN is limited to 255 characters, and each label (node delimited by a dot in the hostname) is limited to 63 characters

and

1-character limit botton (example: t.co)

Question: How do I remove from my list, domains with hostnames greater than 63 characters and less than 1, by command line? (bash to run idn without error)

Actions: I have tried the following (although I wish it was all in one command) (partial source):

sed -n '/.\{63\}/p' list > out
grep -vi -f <(sed 's:^\(.*\)$:\\\1\$:' out) list | sort -u > out2

But when I run the idn command, the same message idn comes up

cat out2 | idn
idn: idna_to_ascii_4z (big list): Output would be too big or too small

I appreciate any help

PD: Maybe the problem is related to IDN and the size of the list (which is very large). I do not know. I have no information if IDN has any limitations on the number of lines | domains | hostnames to process. The help file does not provide much information on this point

Update: The problem was solved, but the correct answer was eliminated by the author @cas, apparently due to a spam incident. Vote for closing

acgbox
  • 941
  • 1
    Check for double dots, e.g. hello..com. You may do that with grep -F '..' list. – Kusalananda Aug 27 '19 at 15:47
  • @Kusalananda the list is already clean of ".." (..com, etc). I use a TLDs validation system before processing it. I also use a system to remove overlapping domains (subdomain.domain.com and domain.com = only domain.com) – acgbox Aug 27 '19 at 16:00
  • 1
    I'm voting to close this question as off-topic because U&L shouldn't be helping scammers. or spammers. – cas Aug 28 '19 at 17:31
  • cas' comment is in response to a pastebin link provided by ajcg showing various domain names. @ajcg, I'd encourage you to provide representative input data for your question. – Jeff Schaller Aug 28 '19 at 18:13
  • @cas sorry about this spam incident. The problem is already solved thanks to you. – acgbox Aug 28 '19 at 18:22
  • ajcg, it sounds like cas is concerned about the contents of the pastebin link, which is why they deleted their answer. – Jeff Schaller Aug 28 '19 at 18:36
  • @JeffSchaller Thank you for your concern. If 'cas' had told me, I delete the link. 'cas' must publish his answer again to select it as correct. Since he gave me a lot of additional information that helped me solve other additional problems I had on my list, not described in the question. That is why I consider that this selection deserves. – acgbox Aug 28 '19 at 18:57
  • 1
    ... which is why I encouraged you to post representative data so that Answerers could test their solution against it. – Jeff Schaller Aug 28 '19 at 20:15

1 Answers1

0

I don't think that idn has any switches to skip unacceptable strings instead of exiting with an error, so the only option left is to restart it after an expected error:

idn_skip(){
    while ! error=$(idn 2>&1 >&3); do
        case $error in *'Punycode failed'*|*'Output would be too large'*) ;; # restart
        *) break;;
        esac
    done 3>&1
}

idn_skip < domain_list

This is ugly and stupid, and will not work when reading the domain list from a non-seekable file (which could be fixed bash-style by running it as stdbuf -i1 idn, but that will only make it even more ridiculous).

Instead of trying to overcome idn's limitations, my advice would be to use the Net::LibIDN perl package (apt-get install libnet-libidn-perl on debian) and write the whole thing in perl.