1. Convert a pdf into csv
pdf data:
Candidate Choice Absentee Mail Early Voting Election Day Total Votes
TODD RUSS 7,021 8,194 135,216 150,431
CLARK JOLLEY 7,012 5,835 107,714 120,561
Regular expression to find the double spaces and change them to
commas:
# find the double spaces
\s+\s
# replace the double spaces with a comma space
,\s
.csv file:
Candidate Choice, Absentee Mail, Early Voting, Election Day, Total Votes
TODD RUSS, 7,021, 8,194, 135,216, 150,431
CLARK JOLLEY, 7,012, 5,835, 107,714, 120,561
3. Remove the genus and species names from a dataset
Dataset:
Banded sculpin, Cottus carolinae, 5
Redspot chub, Nocomis asper, 5
Northern hog sucker, Hypentelium nigricans, 6
Creek chub, Semotilus atromaculatus, 8
Stippled darter, Etheostoma punctulatum, 9
Smallmouth bass, Micropterus dolomieu, 10
Logperch, Percina caprodes, 13
Slender madtom, Noturus exilis, 14
Regular expression to remove genus and species:
# (common name)+comma space+(genus species)+comma space+(number)
(\w+\s*\w+\s*\w+)+,\s+(\w+\s*\w+)+,\s+(\w+)
# common name, number
\1, \3
New dataset:
Banded sculpin, 5
Redspot chub, 5
Northern hog sucker, 6
Creek chub, 8
Stippled darter, 9
Smallmouth bass, 10
Logperch, 13
Slender madtom, 14
5. Abbreviate the genus and species as G_spe.
Regular expression:
# (common name)+comma space+(G)enus+one or more spaces(spe)cies+comma space+(number)
(\w+\s*\w+\s*\w+)+,\s+(\w)\w+\s*(\w{3})\w+,\s+(\w+)
# common name, G_spe., number
\1, \2_\3., \4
6. Using the Cimex lectularius genome (6.1) and mitogenome
(6.2), complete the following:
Commands:
# pull all the lines that contain ">" and write it to a new file "fasta_headers_Clec.txt"
grep '>' C_lec.fna > fasta_headers_Clec.txt
Head of the new file:
# print the top 10 lines of the new file to the standard output
head fasta_headers_Clec.txt
>NM_001316700.2 Cimex lectularius apyrase (LOC106669828), mRNA
>NM_001316702.1 Cimex lectularius NADPH--cytochrome P450 reductase (LOC106668336), mRNA
>NM_001316703.1 Cimex lectularius sodium channel protein para (LOC106667833), mRNA
>NM_001316704.1 Cimex lectularius 72 kDa inositol polyphosphate 5-phosphatase-like (LOC106662976), mRNA
>NM_001316705.1 Cimex lectularius acetylcholinesterase-like (LOC106669386), mRNA
>NM_001316706.1 Cimex lectularius acetylcholinesterase-like (LOC106664272), mRNA
>NM_001316707.1 Cimex lectularius acetylcholinesterase-like (LOC106669436), mRNA
>NM_001316708.1 Cimex lectularius odorant receptor coreceptor (LOC106665376), mRNA
>NM_001316709.1 Cimex lectularius probable cytochrome P450 6d5 (LOC106673892), mRNA
>XM_014383668.2 PREDICTED: Cimex lectularius uncharacterized LOC106674453 (LOC106674453), mRNA
6.2 Create a new file that contains the full sequences of only the
ribosomal transcripts or proteins:
Commands:
# insert a new line inbetween the transcripts | pull out the ribosomal headers and transcripts
sed -e 's/>/ \n>/g' C_lec.fna | sed -n '/ribosom/,/^ /p' > RNA_Clec.txt
Head of the new file:
# list the top 20 lines of the Ribosomal RNA file
head -20 RNA_Clec.txt
>XM_014383747.2 PREDICTED: Cimex lectularius 60S ribosomal protein L22-like (LOC106660802), mRNA
CGGTAAATTTGGTGAAAAGTTTTAGCTGTACTTTTGTTTTTAAACAGAATAAGTTGATTTCCTCTCAGGTAGTTTAGTGA
TTAATGGAAAAATGCAGAGCATTTATATTTTAAAAGCTACTTGAAATAAAAATAAAGTATAAATGATGATTAAGTCATCG
ACGAAGTAGGTCAACGTGTGTTCACCTTCCTTCTAACCTTTTTAGATCGCCCATTTTGTGATCCACGCGTTTCTATCAGT
TCGTGCGGACTTTTGTTTTTTTTTACCATGGCACCTGCAAAGAAACCTGGTGCCGCTGCAAAAAAGACACCGACGACAAC
AGTCACTTCTGCGAAAGTAGGAAAGACGGCTGCCAAACCAACGGGAGGTGCGACGGCAGCGAAATCGGCCCCTGCCCCGG
CTGCCAAACCTGCCCCTGCGAAACAGGCGACGGCGAAGCCTGCACAGCAGAAGCAGACGACGAAGCAGCCTCAGGCAGCG
ACCAAAGCCGCCGCTCCTGCTGCGAAGCAGGTGACGGCGCAGTCTAAAGCTGCACCTGCACCGAAGGCGCCTGCGGCAGC
CAGTAAACCCGCTGCTCAGACAAAGCCGGCTCCGGCTGCACAAAAACCCACGACAGCTCCGGCTGCTAAGAAACCAGCGG
CAGCCCCAGCACAGAGTGGCGCTTTGAAAAAAGCCGTTCAGCCTAAAAAGGCACTAGGTGCTTCGAAAAAACCACACCAA
GGTGTGAAGAAACAAACTCTTAAAGGGAAAGGACAAAAGAAAAAGAAGGTCTCTTTAAAGTTTGTTATCGATTGTACACA
TCCGTATGAAGATAAAATCATTGATGTTGCTAACTTTGAAAAGTATCTTCAGGAAAGAATAAAAGTGAATGGAAAAACCA
ACAATTTTGGAAACAACCTTCAGCTAGAGAGAAATAAGATGAAAATTATTGTAACATCAGATATTCACATGTCTAAAAGA
TATTTGAAGTATCTCACGAAAAAGTATCTGAAAAAAAATAACCTCCGAGATTGGCTTAGAGTTGTTGCAAGTTCTAAAGA
CACCTATGAACTTAGGTACTTCCAGTTCAACAGCCAAGAAGATGAAGATGATGAGGATAATGATTGAAATCATTGCTTTT
AAAATATGATATTTTGTAAATTCTTTGTAACCAAAAGTTTACAAAACAGTTGTAA
>XM_014383760.2 PREDICTED: Cimex lectularius 50S ribosomal protein L1 (LOC106660813), mRNA
TACAAAATATCTTAAATTCTTCAAAGGTCGTTCTACTGAGTACTTGGTCGCAACTCTGTATTGGGAGGTTTCCCTTTCAA
AGTACGATCATGGACTTATCCAGAACAGCTTTCACATTGCTAAGTAGGCCTTGGTCTATTTACCAGAGAGCAGTCCAGTT