当序列少的时候,我习惯用 grep -A 1 -f seq.lst seq.fas | sed ‘/^–$/d’ > out.fas提取,可是此次遇到了一个大文件,用grep就太费时了,而后又试了一下TBtools的提取序列功能,发现时间也很长,因此就写了个python。提取将近100万条reads耗时也就须要10s左右python
#!/usr/bin/python # -*- coding: utf-8 -*- # Usage: python script.py lstFile seqFile outfile import argparse parser = argparse.ArgumentParser() parser.add_argument("lstFile", help="input a list file", type=str) parser.add_argument("seqFile", help="input a sequence file", type=str) parser.add_argument("outfile", help="output a file", type=str) args = parser.parse_args() lst = args.lstFile seq = args.seqFile out = args.outfile frLst = open(lst, 'r') frSeq = open(seq, 'r') fwOut = open(out, 'w') query = [] database = {} for i in frLst.readlines(): query.append(i.strip()) for i in frSeq.readlines(): if i.startswith('>'): keys = i.lstrip('>').strip() database[keys] = [] else: database[keys].append(i.strip()) for line in query: fwOut.write('>' + str(line) + '\n' + str(''.join(database[line])) + '\n') fwOut.close()