I’m trying to modify a given text file, wherein I want to change/alter the following strings, eg:
lcl|NC_018257.1_cds_XP_003862892.1_5067 lcl|NC_018241.1_cds_XP_003859498.1_1683 lcl|NC_018256.1_cds_XP_003862456.1_4633 lcl|NC_018237.1_cds_XP_003858978.1_1163 lcl|NC_018254.1_cds_XP_003861926.1_4104
so that it only contains the
XP_n.1 part of the string.
I have successfully removed the
lcl|NC\_*.1_cds\_ part out of the strings for which
I used the following
sed 's/lcl|NC\_.\*_cds_//g' cds.fa > cds4.fa
The resultant text file contains strings like
There are about 8014 strings like this ranging from
XP_*.1_8014. I want to delete the
_8014 part of the string and replace it with 1.
I tried using
and it seemed to have worked, however when I scrolled further down the list of strings, the double digit numbers didn’t get replaced – only one of the digits was replaced, which immediately followed the ‘_’, resulting in the first digit turning into 1 and the rest retaining their original identity. Same with triple and quadruple digit numbers.
XP_003857837.1_23 ---> XP_003857837.13 XP_003857942.1_228 ---> XP_003857942.128
I have absolutely no idea how to remove this, all my attempts have led to failure. Some people have asked me for what my desired output should look like, the ideal output would be: XP_003857837.1, each string should be followed by a .1 instead of .1_SomeNumberRangingFrom1to8014
You can do everything in one go with a slightly more complex regex.
sed 's/lcl|NC_.*_cds_\(XP_[0-9.]*\)_.*/\1/' cds.fa > cds4.fa
The backslashed parentheses create a capturing group, and
\1 in the replacement recalls the first captured group (
\2 for the second, etc, if you have more than one). The regex inside the group looks for
XP_ followed by digits and dots, and the expression after matches the rest of the line from the next uderscore on.
In other words, this basically says "replace the whole line with just the part we care about".
By the by, there is no reason to backslash underscores anywhere, and the
/g option to the
s command only makes sense when you want to replace multiple occurrences on the same input line.
Answered By – tripleee
Answer Checked By – Gilberto Lyons (AngularFixing Admin)