0

I have some addresses.csv in different international formats

Example Street 1
Teststraße 2
Teststr. 1-5
Baker Street 221b
221B Baker Street
19th Ave 3B
3B 2nd Ave
1-3 2nd Mount x Ave
105 Lock St # 219
Test Street, 1
BookAve, 54, Extra Text 123#

For example we in Germany write Teststraße 2 and in the USA 2 Test Street

Is there a way to seperate/extract all street names and street numbers? output-names.csv

Example Street
Teststraße
Teststr.
Baker Street
Baker Street
19th Ave
2nd Ave
2nd Mount Good Ave
Lock St # 219
Test Street
BookAve

output-numbers.csv

1
2
1-5
221b
221B
3B
3B
1-3
105
1
54

output-extra_text.csv


Extra Text 123#

I am using macOS 13.. the shell is zsh 5.8.1 or bash-3.2


my thoughts that i had: you could sort the addresses first like this:

x=The-adress-line;
if [ x = "begins with a letter"];
    then 
    if [ x = "begins with a letter + number + SPACE"];
        then
        echo 'something like "1A Street"';
        # NUMBER = '1A' / NAME = 'Street'
    else
        echo 'It begins with the STREET-NAME';
    fi;
elif [ x = "begins with a number"];
    then
    echo 'maybe STREET-NAME like "19th Ave 19B" or STREET-NUMBER like "19B Street"';
    # NUMBER = '19B' / NAME = '19th Ave' or 'Street'
    if [ x = "begins with a number + SPACE"];
        then
        echo 'It begins with the STREET-NUMBER like "1 Street"';
        # NUMBER = '1' / NAME = 'Street'
    elif [ x = "is (number)(text)(space)(text)(number(maybe-text))"];
        then
            echo 'For example 19th Street 19B -> The last number+text is the number (19B)'
            # NUMBER = '19B' / NAME = '19th Street'
    elif [ x = "is (number(maybe-text))(space)(number)(text)(space)(text)"];
        then
        echo 'For example 19B 19th Street -> The first number+text is the number (19B)'
            # NUMBER = '19B' / NAME = '19th Street'
    else
        echo 'INVALID';
else
    echo 'INVALID';
fi;
R 9000
  • 167
  • What about "42nd street"? I mean, pretty much anything, including numbers, can be street names. – terdon Mar 02 '23 at 16:51
  • Exactly.. "42nd street 3" (DE) or "3 42nd street" (US) means -> number="3" and name="42nd street" – R 9000 Mar 02 '23 at 17:01
  • 3
    Which is why I don't think it is possible to automate this short of using an actual AI trained on real street names :/ – terdon Mar 02 '23 at 17:08
  • I think it is possible.. for your example see "my thoughts" what I just added – R 9000 Mar 02 '23 at 17:22
  • What if the address is "Flat B, 72 street"? Or "The Brown Cottage, Hanwell"? Or "Number 12, Foo street"? – terdon Mar 02 '23 at 18:20
  • "Flat B, 72 street" -> "TEXT A, 00 TEXT".. so the only number is 72.. the question is.. is "Flat B" or "Street" the street name.. good question... 2) "The Brown Cottage, Hanwell" -> No number -> invalid 3) "Number 12, Foo street" -> "TEXT 00, TEXT TEXT".. so the number is 12.. but like in (1).. what is the street and what is "extra text"... good question.. maybe someone knows a solution
  • – R 9000 Mar 02 '23 at 18:43
  • So the problem is.. is it "Street 72 EXTRA-TEXT" or "EXTRA-TEXT 72 Street".. it is ok for me if it begins with "EXTRA-TEXT.. then it is -> invalid – R 9000 Mar 02 '23 at 18:55
  • 2
    That pseudo-code is shell-like. You would not do something like this in shell as it'd be hard to get the syntax right and take forever to run. See why-is-using-a-shell-loop-to-process-text-considered-bad-practice. You should use awk or some other general-purpose text-processing tool instead. – Ed Morton Mar 02 '23 at 19:33
  • How would you handle some rural addresses like 4593 NC 39? – doneal24 Mar 03 '23 at 00:07
  • @doneal24 adress is invalid :p – R 9000 Mar 03 '23 at 01:45
  • How about something like "1e Korte Dwarsstraat 525 BG", pretty common with your neighbors in NL or Belgium? A tool that doesn't accept valid addresses, as you've shown above, is hardly an "ultimate tool". I agree with terdon, not doable w/o a reasonably well-working AI. – Peregrino69 Mar 03 '23 at 06:37
  • @R9000 My mother would be very distressed to hear that. :) – doneal24 Mar 03 '23 at 19:59