Regular expression for a string

Question

Hi I have a md file containing the below string and I want to write a regular expression for this.

Conditions

The id will be anything.
The type will be youtube,vimeo etc
ID and type are mandatory fields

{% include video.html id="T3q6QcCQZQg" type="youtube" %}

So I want to check the string is in a proper format in bash script otherwise will through an error.

Current code look like this . The below code is working for me without an ID. But I need to add a regex for id as well

IFS=$'\n' read -r -d '' -a VIDEOS < <( grep  "video.html"  "$ROOT_DIR$file" && printf '\0' )
#output => {% include video.html id="T3q6QcCQZQg" type="youtube" %}
for str in "${VIDEOS[@]}"

 do

       if [[ "$str" =~ ({%)[:space:][:space:][:space:][:space:]$ ]]; then
            flag="dummy"
            echo "Invalid format::  $second"
        fi
done

Please help

What does the rest of the file look like? Do any of the strings in that line appear anywhere else in the file? What have you tried so far? — Nasir Riley, Jul 05 '21 at 12:48
@NasirRiley hey I updated the question with code. please check — Meera Sebastian, Jul 05 '21 at 13:02
You have code stating "Remove possible leading /" etc. Can you show a verbatim example (possibly anonymized) of the initial grep output? — AdminBee, Jul 05 '21 at 13:14
@AdminBee I am searching through the md file I have. The first grep output is `{% include video.html id="T3q6QcCQZQg" type="youtube" %}
{% include video.html id="330853122" type="vimeo" %} ` — Meera Sebastian, Jul 05 '21 at 13:23
Would id="T3q6QcCQZQg (no closing quote) be invalid? This file format isn't going to be easy to validate. Since it looks a little bit like XML, a basic XML parser might be the way to go. — Jeremy Boden, Jul 05 '21 at 13:51
Follow-up question: Is the task really only to verify the specification and print an error if an invalid specification was found, or do you want to execute some code in case of an invalid specification. If the former, a dedicated tool like awk is certainly better suited for the task. If the latter, a shell script may indeed be your best approach. — AdminBee, Jul 06 '21 at 15:24

score 0 · Answer 1 · edited Jul 05 '21 at 14:48

In principle you are almost there. The following is a minimal testable version of a regular expression based on the example content you provided:

#!/bin/bash
VIDEOS=( '{% include video.html id="T3q6QcCQZQg" type="youtube" %}' '{% include video.html id="330853122" type="vimeo" %}' '{% include video.html id="330853122" type="nosuchplatform" %}')
regex='^{% include video.html id="[^"]+" type="(youtube|vimeo)" %}$'
for v in "${VIDEOS[@]}"
do
    if [[ "$v" =~ $regex ]]
    then
        echo "$v : valid"
    else
        echo "$v : invalid"
    fi
done

The varying id field can be matched using the "[^"]+" construct, i.e. "a starting ", followed by anything that is not a ", and then one "". You can make it more specific if you know what characters are allowed for the id field, i.e. if you know it can only be alphanumeric characters, try "[[:alnum:]]+" instead.

By storing the regular expression in a shell variable you can avoid several problems in formulating it, just be sure to not quote the variable when using it in your test.

I also assumed that if the regular expression matches you want to output valid (currently you would consider sucess of the =~ test as "invalid" pattern).

@MeeraSebastian You are welcome. If you found this (or any other) answer useful, please consider accepting it so that others facing a similar problem may find it more easily. — AdminBee, Jul 06 '21 at 10:08
AdminBee Seems I dont have enough reputation for accepting an answer — Meera Sebastian, Jul 07 '21 at 17:32

glenn jackman · Answer 2 · 2021-07-05T16:44:05.540

0

Since the id and type tags are (probably) not required to be in that order, I'd use a series of regex tests:

for str in "${VIDEOS[@]}"; do
    if [[ $str =~ \{%[[:blank:]]+include[[:blank:]]+.*[[:blank:]]+%\} ]] &&
       [[ $str =~ \<id=\"[^\"]+\" ]] &&
       [[ $str =~ \<type=\"(youtube|vimeo)\" ]]
    then
        echo "valid"
    else
        echo "invalid"
    fi
done

edited Jul 05 '21 at 16:44

answered Jul 05 '21 at 16:35

glenn jackman

85,964

score 0 · Answer 3 · answered Jul 06 '21 at 03:02

bash is great at co-ordinating the execution of other programs, but it's a terrible language for text processing. You should use awk or perl for this. See Why is using a shell loop to process text considered bad practice?.

e.g. with a perl "one-liner":

$ perl -lne 'next unless m/{%.*video\.html.*%}/;
             ($id) = m/\bid\s*=\s*"([^"]+)"/i;
             ($type) = m/\btype\s*=\s*"(youtube|vimeo)"/i;
             print "Invalid format on line $. of $ARGV: $_" unless ($id && $type);' *.md

This allows for id and type to be in any order anywhere on the line, and also allows for optional extra whitespace (\s*) around the = symbols. It expects the entire video include to be on a single line (a more robust version could allow for multi-line strings, but this script doesn't do that). It can process multiple input files at once (e.g. *.md) and will tell you the line number and filename of any invalid lines it finds.

If you want to allow any value for $type (not just youtube or vimeo), replace the third line with:

($type) = m/\btype\s*=\s*"([^"]+)"/i;

or just add more allowed types in the alternation.

Same script as a standalone executable:

#!/usr/bin/perl
use strict;
while(<>) {
  chomp;
  next unless m/{%.video.html.%}/;
  my ($id) = m/\bid\s=\s"([^"]+)"/i;
#my ($type) = m/\btype\s=\s"([^"]+)"/i;
  my ($type) = m/\btype\s=\s"(youtube|vimeo)"/i;
print "Invalid format on line $. of $ARGV: $_\n" unless ($id && $type);
}

save as, e.g., verify-videos.pl somewhere in your PATH (e.g. ~/bin/ or /usr/local/bin/) and make executable with chmod +x /path/to/verify-videos.pl.

That's entirely your choice, but you need to know that you will be using the wrong tool for the job. Bash is not suited to text processing, using it for that task will make the programming required more difficult and error-prone, as well as much slower. — cas, Jul 06 '21 at 10:10

Regular expression for a string

3 Answers3