Assignment 3

For these problems, use your plain text editor (e.g. Notepad++). Paste the test into a new document and use the search function to write a regular expression that gives the desired result. Submit your assignment as the html output of an Rmarkdown document linked in your class portfolio. In that document, provide the regular expression that works (there may be several possible “right” answers) as plain text markdown page followed by an explanation of what each element the expression is doing. If you get stuck, give the solution that gets you as close as you can.

1. “Our preferred format for data is a csv file. Use regular expressions to convert this table I copied form a pdf into a csv format”

Original:

Candidate Choice    Absentee Mail   Early Voting    Election Day    Total Votes
TODD RUSS   7,021   8,194   135,216   150,431
CLARK JOLLEY    7,012   5,835   107,714   120,561

Steps:

Find \, and replace with nothing: This will find all commas (those between numbers), and removes them
Find \s{2,} and replace with comma space ", ": This finds two or more consecutive spaces, and replaces with one comma followed by a space, as is necessary for a csv.

Result:

Candidate Choice, Absentee Mail, Early Voting, Election Day, Total Votes
TODD RUSS, 7021, 8194, 135216, 150431
CLARK JOLLEY, 7012, 5835, 107714, 120561

2. Reformat our class roster.

Original:

Adamic, Emily M.    ema3896@utulsa.edu
Bierbaum, Emily L.  elb0588@utulsa.edu
Cartmell, Laci J.   ljc454@utulsa.edu
Delaporte, Elise    eld0070@utulsa.edu
Hansen, Rebekah E.  reh9623@utulsa.edu
Herrboldt, Madison A.   mah1626@utulsa.edu
Lewis, Cari D.  cdl5261@utulsa.edu
Mierow, Tanner T.   ttm5619@utulsa.edu
Naranjo, Daniel S.  dsn8679@utulsa.edu
Paslay, Caleb   cap1050@utulsa.edu
Pletcher, Olivia M. omp9336@utulsa.edu
West, Amy C.    acw1471@utulsa.edu

Steps

Find \s\w.\s and replace with three spaces: This will find the middle initial (“space / single letter / period / space”) and remove it, replacing with three spaces to uniquely separate names and emails.
Find , and replace with a space: This will find the commas, and replace with a space; now we have “word word word” which will be easier to capture and reorder.
Find @utulsa.edu replace with nothing: This will remove the @tulsa, which makes the three parts more uniform.
Find (\w+)\s+(\w+)\s+(\w+), replace with \2 \1 \(\3@utulsa.edu\): This will “capture” each word (w+, one or more word characters that are separated by +, one of more spaces) and “store” them, to be replaced in their desired order separated by one space, while also adding back in the @utulsa.edu.

Note: Step 4 does not work on Atom, the text editor I have been using on my Mac. Atom does not allow for capturing and re-ordering using the “\1”, it just literally prints “\1”.

Result:

Emily Adamic (ema3896@utulsa.edu)
Emily Bierbaum (elb0588@utulsa.edu)
Laci Cartmell (ljc454@utulsa.edu)
Elise Delaporte (eld0070@utulsa.edu)
Rebekah Hansen (reh9623@utulsa.edu)
Madison Herrboldt (mah1626@utulsa.edu)
Cari Lewis (cdl5261@utulsa.edu)
Tanner Mierow (ttm5619@utulsa.edu)
Daniel Naranjo (dsn8679@utulsa.edu)
Caleb Paslay (cap1050@utulsa.edu)
Olivia Pletcher (omp9336@utulsa.edu)
Amy West (acw1471@utulsa.edu)

3. Use regular expressions to drop the genus name

Original:

Banded sculpin, Cottus carolinae, 5
Redspot chub, Nocomis asper, 5
Northern hog sucker, Hypentelium nigricans, 6
Creek chub, Semotilus atromaculatus, 8
Stippled darter, Etheostoma punctulatum, 9
Smallmouth bass, Micropterus dolomieu, 10
Logperch, Percina caprodes, 13
Slender madtom, Noturus exilis, 14

Steps:

Find ,\s\w+\s\w+, and replace with ,: this finds the comma before the genus, space, one or more word charaters (first word in the genus), another space, one or more word characters (second word in the genus), and the comma following the genus, effectively isolated the genus, removing it, and replacing with just a comma.

Result:

Banded sculpin, 5
Redspot chub, 5
Northern hog sucker, 6
Creek chub, 8
Stippled darter, 9
Smallmouth bass, 10
Logperch, 13
Slender madtom, 14

4. With the original data set, use regular expression to modify the names

Steps:

Find ,\s(\d+) and replace with five spaces and then \1: This gets rid of the comma preceding the number at the end, and replaces with five spaces which are unique. Now, the commas are only around the first word of the genus, which we want to modify.
Find ,\s(\w)\w+\s and replace with , \1_: this finds a space (and captures first letter) followed by one or more word characters and a space, effectively the first word of the genus. It replaces with the first letter and underscore as desired.
Fix the five spaces: find \s{5} and replace with , space: This fixes the five spaces, replacing with the desired comma space.

Result:

Banded sculpin, C_carolinae, 5
Redspot chub, N_asper, 5
Northern hog sucker, H_nigricans, 6
Creek chub, S_atromaculatus, 8
Stippled darter, E_punctulatum, 9
Smallmouth bass, M_dolomieu, 10
Logperch, P_caprodes, 13
Slender madtom, N_exilis, 14

5. Starting with the original data set abbreviate the genus species

Steps:

Same step 1 as the preceding question
Find ,\s(\w)\w+\s(\w{3})\w+ and replace with , \1_\2.,: This will capture the first letter of the first word, and the first three letters of the second word, and replace with the first captured element_second captured element followed by a period, as desired.
Same step 3 as preceding question, clean up the five spaces and replace with single space

Result

Banded sculpin, C_car., 5
Redspot chub, N_asp., 5
Northern hog sucker, H_nig., 6
Creek chub, S_atr., 8
Stippled darter, E_pun., 9
Smallmouth bass, M_dol., 10
Logperch, P_cap., 13
Slender madtom, N_exi., 14

6. Create a new file that contains only the fasta headers (lines that begin with >) from your file

I have downloaded the homo sapiens hemoglobin subunit alpha 1 from the NIH website, unzipped it, and saved the fasta file as “Hb_gene.fna” in my working directory.

To create a file that contains only the headers:

grep ">" Hb_gene.fna > Hb_gene_headers.txt

cat Hb_gene_headers.txt
>NC_000016.10:176680-177522 HBA1 [organism=Homo sapiens] [GeneID=3039] [chromosome=16]
>NC_060940.1:170722-171564 HBA1 [organism=Homo sapiens] [GeneID=3039] [chromosome=16]

Assignment 3

Emily Adamic

9/15/2022

1. “Our preferred format for data is a csv file. Use regular expressions to convert this table I copied form a pdf into a csv format”

2. Reformat our class roster.

3. Use regular expressions to drop the genus name

4. With the original data set, use regular expression to modify the names

5. Starting with the original data set abbreviate the genus species

6. Create a new file that contains only the fasta headers (lines that begin with >) from your file