Split a complicated string?

Question

Q: how do I split a complicated string when whitespace delimiters aren't discriminating enough?

Background

I'm working with BibTeX files. I want to split an author string of the form "first-name middle-name last-name" into its tokens.

Doing so is straightforward when each name is a single word because we can split by whitespace:

(split-string "Adam Smith")             ; => ("Adam" "Smith")

Problem: double-barrelled names

The problem is that some names (usually last names, but not always) are multiple words separated by whitespace. BibTeX handles that by enclosing the name entities in curly braces, as with "Martin {Van Buren}". Here, a simple whitespace split gives the wrong answer

(split-string "Martin {Van Buren}")     ; => ("Martin" "{Van" "Buren}")

Question: split complicated names?

So: how do I split complicated names that a) can have an arbitrary number of names, and b) each curly-braced name component can have an arbitrary number of whitespace-separated chunks?

Example desired output

Given a hypothetical function fancy-split, I'm looking to extract a list of name components:

(fancy-split "A Simple Name")           ; => ("A" "Simple" "Name")
(fancy-split "Walter J. {Fancy Pants}") ; => ("Walter" "J." "Fancy Pants")
(fancy-split "Ed {Big D} {del Mar}")    ; => ("Ed" "Big D" "del Mar")
(fancy-split "John {El Guapo} Smith")   ; => ("John" "El Guapo" "Smith")

Time to write a simple tokenizer... – wasamasa Oct 28 '17 at 14:45 — wasamasa, Oct 28 '17 at 14:45

score 6 · Accepted Answer · answered Oct 28 '17 at 15:02

6

Behold:

(defun fancy-split (input)
  (let (tokens)
    (with-temp-buffer
      (insert input)
      (goto-char (point-min))
      (while (not (eobp))
        (cond
         ((looking-at "{")
          (forward-char)
          (let ((start (point)))
            (if (search-forward "}" nil t)
                (let ((token (buffer-substring start (1- (point)))))
                  (push token tokens))
              (error "Unterminated brace"))))
         ((looking-at "[[:space:]]+")
          (goto-char (match-end 0)))
         ((looking-at "[^[:space:]{]+")
          (let ((token (buffer-substring (point) (match-end 0))))
            (push token tokens)
            (goto-char (match-end 0)))))))
    (nreverse tokens)))

It even detects braces you forgot to close (but doesn't handle nesting).

answered Oct 28 '17 at 15:02

wasamasa

21,803
1
65
97

Thank you! I've never written a tokenizer before, so this was a neat learning experience. I also figured out an alternative way to do it using regular expressions, which I posted. It's less functional, but I'm only working with well-specified name strings. Surprisingly (to me), it's also a little faster. – Dan Oct 28 '17 at 20:27
Tokenization is typically a prerequisite to parsing, suppose I'd have recognized braces as 1-char tokens, I could have written a function that keeps track of whether it encounters balanced parentheses and if yes, adds a compound name to its parse tree. That would resolve this one issue, but at questionable cost. – wasamasa Oct 28 '17 at 20:51

Dan · Answer 2 · 2017-10-28T20:28:19.647

Here's a recursive function that returns the name as a list of its components.

(defun name-to-list (name &optional acc)
  "Takes a BibTeX-style name and returns a list of the name
components."
  (if (string-equal name "")
      (nreverse (mapcar (lambda (part)
                          (string-trim part "[ {]+" "[ }]+"))
                        acc))
    (let ((re-beg "\\`\\s-*")
          (re-txt "\\({.*?}\\|[^ \t]+\\)")
          (re-end "\\s-*\\(.*?\\)\\s-*\\'"))
      (string-match (concat re-beg re-txt re-end) name)
      (name-to-list (match-string 2 name)
                    (cons (match-string 1 name) acc)))))

(mapcar #'name-to-list
        '("A Simple Name"
          "Walter J. {Fancy Pants}"
          "Ed {Big D} {del Mar}"
          "John {El Guapo} Smith"))

Note the ugly call to string-trim when it returns the values. I feel like that's cheating (or at least ugly), but I'm not skilled enough with regular expressions to figure out how to exclude the the curly braces. If anyone can point out how to do so, I'd be grateful.

wvxvw · Answer 3 · 2017-10-29T17:13:41.110

Here's something you could do using PEG parser. PEG parsers are a formalism allowing for generation of parsers without a tedious process like the one associated with YACC-style parsers. PEG-style parsers cannot capture some context-free languages, but due to easy augmentation with user code, they may often capture even some complex context-sensitive languages.

Unfortunately, the documentation of PEG parser in ELisp is lacking, I managed to put together this example mostly due to some familiarity with different implementations of this concept, so yymv...

So, here we go:

(require 'peg)

(defun fancy-split (input)
  (peg-parse-string
   ((s (list (* (* " ") elt-or-group)))
    (elt-or-group (or group elt))
    (group (substring "{" (* (* " ") (+ (not " ") (not "}") (any))) "}"))
    (elt (substring (+ (not " ") (any)))))
   input))

;;; Examples:

(fancy-split "foo bar baz")
(("foo" "bar" "baz"))

(fancy-split "bar {foo bar} foo")
(("bar" "{foo bar}" "foo"))

Split a complicated string?

Background

Problem: double-barrelled names

Question: split complicated names?

Example desired output

3 Answers3

Linked