0

I want to find the date range for each of the 4 CSV Files in a directory. The date column is the last column in each csv and is of the form 1/25/2012 7:20:55 PM.

The date field is always the last column that is the 24th column. Only the records in each csv change. And the date is ordered.

Is there a way that I can do it for each of the files and get the final range, assuming that the dates are ordered?

So if my first record is 1/25/2012 7:20:55 PM and last record is 11/7/2016 2:36:20 PM.

I want the output date range as 1/25/2012 - 11/7/2016. But I want to club all the 4 files date range.

Example input (some columns omitted for brevity):

  • First file:

    col1,col2,col3,...,col23,col24_time
    value1,value2,value3,...,value23,1/25/2012 7:20 PM
    value1,value2,value3,...,value23,1/26/2012 10:57 AM
    value1,value2,value3,...,value23,1/26/2012 2:20 PM
    value1,value2,value3,...,value23,1/30/2012 11:55 AM
    value1,value2,value3,...,value23,1/30/2012 3:17 PM
    value1,value2,value3,...,value23,1/30/2012 5:36 PM
    value1,value2,value3,...,value23,1/30/2012 8:16 PM
    ...
    value1,value2,value3,...,value23,4/11/2012 11:45 AM
    value1,value2,value3,...,value23,4/11/2012 2:23 PM
  • Intermediate files

  • Last file:

    value1,value2,value3,...,value23,3/11/2015 4:45 AM
    value1,value2,value3,...,value23,3/11/2015 8:40 AM
    ...
    value1,value2,value3,...,value23,11/7/2016 2:36 PM

I have close to 5-10K records in each file. The dates are ordered within the files. And each file has an header for each column.

The output from the head -n7 Files/file1.csv | cut -d, -f24 command is:

"col24_time"
"2012-01-01 00:30:26"
"0"
"2012-01-01 02:00:37"
"0"
"0"
"https://external.xx.fbcdn.net/safe_image.php?" <<-- previous column record?
  • 5
    Please [edit] your question and include a sample of your csv files. We can't help you parse data you don't show us. You can use the formatting tools to format your files as code. – terdon Sep 16 '20 at 12:15
  • I think I guess what you want to do, if so you can use some sort of filter on reading the CSV file with awk or sed from bash. if you give us an example/sample of your needs we can make you a model. – francois P Sep 16 '20 at 12:52
  • You need the date from the first line and the last line, and then calculate the duration from that, is that right? – glenn jackman Sep 16 '20 at 13:22
  • @glennjackman Yes something like that. – Emma Austin Sep 17 '20 at 00:53
  • @francoisP Edited – Emma Austin Sep 17 '20 at 00:56
  • (1) You say that the date/time is always the last field on the line. These are dumb questions, made necessary by your refusal to volunteer any information: (1a) Does every line have the same number of fields? Or could the date/time be the sixth field of the first line and the eighth field of the last line, if it’s the last field on the line in each case? (1b) If every line has the same number of fields, do you know what that number is? (and will you tell us?) … (Cont’d) – G-Man Says 'Reinstate Monica' Sep 17 '20 at 05:22
  • (Cont’d) …  (2) You’ve said that you want the range, but, when asked whether you want the duration, you said “something like that.” I can easily give you an answer that gives output that is something like what you want; would that be good enough for you? If not, please be more clear and specific about what you want. (3) You say “I want to club all the 4 files date range.” What do you mean by that? … … … … … … … … … … … … … … Please do not respond in comments; [edit] your question to make it clearer and more complete. – G-Man Says 'Reinstate Monica' Sep 17 '20 at 05:22
  • @EmmaAustin So basically you want to output the 24th column of the first line of the first line, and then the 24th column of the last line of the last (=4th) file, but restricted to dates (and omitting time-of-day) ... ? – AdminBee Sep 17 '20 at 07:48
  • @AdminBee Yes. But I have 4 csv files in the directory. I want to do it for each. And find the final range considering all the 4 files. – Emma Austin Sep 17 '20 at 07:50
  • I'm afraid your explanations are still somewhat unclear. In the "sample output" you provided, is 1/25/2012 the timestamp associated with the first record of the first file, and 11/7/2016 the timestamp associated with the last record of the last file? – AdminBee Sep 17 '20 at 07:56
  • @AdminBee edited. Yes you are right. that is what i meant. – Emma Austin Sep 17 '20 at 08:02
  • @EmmaAustin I tried to make the intention clearer by providing "example input" split over several files. Still, the timestamp format you have provided in the example differs from your original specification in that it doesn't contain seconds anymore and switched from 12h-AM/PM to 24h-format. Which one is correct? – AdminBee Sep 17 '20 at 08:16
  • @AdminBee Yeah 12h - AM/PM format – Emma Austin Sep 17 '20 at 09:07
  • ... and with seconds? – AdminBee Sep 17 '20 at 09:26
  • @EmmaAustin Please make sure that edits are consistent. Currently it looks as if only the first file contained a header line. What about the other files? If they don't I would recommend stating that explicitly as it is confusing otherwise. – AdminBee Sep 21 '20 at 10:07

4 Answers4

1

I’m still not sure that I understand the question. But here is some code that produces the desired output from the specified input, and it’s much shorter than the other answers:

datetime1=$(head -n1 file1.csv | cut -d, -f24)
datetime4=$(tail -n1 file4.csv | cut -d, -f24)
printf '%s - %s\n' "${datetime1%% *}" "${datetime4%% *}"

This takes the first line from the first file, and the last line from the fourth and final file, and extracts the 24th field (based on , as a delimiter) of each.  These are date-time strings; specifically, 1/25/2012 7:20 AM and 11/7/2016 2:36 PM.  It then prints the first “word” of each, by stripping off the first space and everything that comes after it.  These are the dates, as desired.

Here’s a quasi-one-liner equivalent.  I spread it across three lines for readability, but it’s logically one long command.

printf '%s - %s\n' \
        "$(head -n1 file1.csv | cut -d, -f24 | cut -d' ' -f1)" \
        "$(tail -n1 file4.csv | cut -d, -f24 | cut -d' ' -f1)"

Here, since we aren’t using any variables, we can’t use parameter expansion, so I extracted the first “word” of the 24th fields with a second cut.

  • I still get "2012-01-01 00:30:26" . Also the head returns the header column name. – Emma Austin Sep 18 '20 at 16:07
  • You were asked to provide sample data.  You provided sample data that didn’t have header lines.  You were asked whether you wanted the dates from the first and last lines, and you said “Yes something like that.”  I asked you to clarify what you meant when you said “something like that” and, as far as I can tell, you ignored that request.  You were asked again whether you wanted the 24th column of *the first line of the first line* [sic], and you said “Yes” with no restriction or refinement. … … So I’m just going to ignore your comment about "2012-01-01 00:30:26" until you start helping me. – G-Man Says 'Reinstate Monica' Sep 18 '20 at 20:52
  • Sorry about that. Added records. – Emma Austin Sep 19 '20 at 02:25
1

If your data is, as you say, already date sequenced and consistently structured then you can use sed to work on your specific lines:

sed -E -n "2 {s/.*,([^ ]*).*/\1 - /;h}; $ {s/.*,([^ ]*).*/\1/;H;x;s/\n//;p}" file

Output for your first file is

1/26/2012 - 4/11/2012

To do all the files together just cat them (assumed to be named in date order and piped in the correct sequence) as

cat file* | sed ...
1/26/2012 - 11/7/2016

Walkthrough

Set sed default as -not to print

sed -E -n "

Grab the 2nd line and collect the part of the line you want ([^ ]+) as a capture group and compose your output in pattern space as your capture and your separator \1 -

           2 {s/.*,([^ ]+).*/\1 - /;

Push that into hold space (h clears out anything that was there before)

                                h}; 

On the la$t line grab the part of the line you want again in pattern space

                                    $ {s/.*,([^ ]+).*/\1/;

Append the new pattern space to the Hold space with a \newline between the old and new content (H adds the \newline) and then exchange the entire contents of hold space with pattern space

                                                H;x;

With your composed output now in pattern space, just remove the unwanted \newline and print

                                                    s/\n//;p}" file
bu5hman
  • 4,756
0

The following awk program will work (let's call it timerange.awk). It is designed so that you don't have to provide the files in any specific order but can simply use it on *.csv - otherwise, you would only need to provide the first and last file in the directory anyway, as you state that timestamps are ordered.

#!/usr/bin/awk -f

For every line of the files (after the first, which contains headers)

FNR>1{ # Break the time stamp field into its individual components and reassemble # in a way that 'mktime' understands, to generate an epoch-based timestamp # for "later/earlier than"-type comparisons. split($NF,a,/[ /:]/); if (a[6]=="AM" && a[4]==12) a[4]=0; if (a[6]=="PM") a[4]=a[4]+12; tst=a[3]" " a[1] " " a[2] " " a[4] " " a[5] " 00"; curr_ts=mktime(tst);

# If we are on the first &quot;data&quot; row of the first file, initialize start and end
# date
if (NR==2)
{
    end=start=$NF;
    end_ts=start_ts=curr_ts;
}
# On all later lines, check if the timestamps associated with the &quot;start&quot;
# and &quot;end&quot; time specifications are later resp. earlier than that of the
# current line. If so, update &quot;start&quot; and &quot;end&quot; specifications.
else
{
    if (curr_ts&gt;end_ts) {end_ts=curr_ts; end=$NF};
    if (curr_ts&lt;start_ts) {start_ts=curr_ts; start=$NF};
}

}

After the last file was processed: Output the human-readable range

END{print start " - " end}

You can call it as

awk -F, -f timerange.awk file1.csv file2.csv ...

or simply

awk -F, -f timerange.awk *.csv

as the ordering of the files is not relevant; it will automatically look for the global "first" and "last" entry.

Output for the example input (in its current form - I have assumed that contrary to your first statement, the timestamp does not include seconds):

1/25/2012 10:57 AM - 11/7/2016 2:36 PM

Update

If you want to completely ignore the time of day, the program can be abridged:

#!/usr/bin/awk -f

For every line of the files (after the first, which contains headers)

FNR>1{ split($NF,a,/[ /:]/); tst=a[3]" " a[1] " " a[2] " 00 00 00" curr_ts=mktime(tst);

sub(/[[:space:]]+.* [AP]M$/,&quot;&quot;,$NF);

if (NR==2)
{
    end=start=$NF;
    end_ts=start_ts=curr_ts;
}
else
{
    if (curr_ts&gt;end_ts) {end_ts=curr_ts; end=$NF};
    if (curr_ts&lt;start_ts) {start_ts=curr_ts; start=$NF};
}

}

END{print start " - " end}

AdminBee
  • 22,803
0

If you like Unix pipes, you can do it this way:

# standalone example: this converts from a 2-colum, 1-line "csv" to unixtime, 
#  and converts back to readable date
echo "2,1/25/2012  7:20:55 PM" \
| perl -aF, -MDate::Parse -E "say Date::Parse::str2time(\$F[1])" - \
| xargs -i date "+%D " -d@{}

result

01/25/12

This relies on the old but non-core Perl module Date::Parse, which you'd need to install first if you don't have it already.
Perhaps install with cpan Date::Parse, or with the zero-configuration client cpanm Date::Parse.

So for your example you could try this to get youngest and oldest on two single lines

perl -aF, -MDate::Parse -E "say Date::Parse::str2time(\$F[5])" *.csv \
| sort \
| sed -e 1b -e '$!d'  \
| xargs -i date "+%D " -d@{}

result

01/25/12 11/07/16

The sed line is from this post on this site.

knb
  • 141