java
sql
php
iphone
css
c
python
ruby-on-rails
mysql
objective-c
visual-studio
eclipse
flash
perl
oracle
cocoa
apache
mvc
php5
postgresql
At the moment, the csv module does not support UTF-16.
In Python 3.x, csv expects a text-mode file and you can simply use the encoding parameter of open to force another encoding:
open
# Python 3.x only import csv with open('utf16.csv', 'r', encoding='utf16') as csvf: for line in csv.reader(csvf): print(line) # do something with the line
In Python 2.x, you can recode the input:
# Python 2.x only import codecs import csv class Recoder(object): def __init__(self, stream, decoder, encoder, eol='\r\n'): self._stream = stream self._decoder = decoder if isinstance(decoder, codecs.IncrementalDecoder) else codecs.getincrementaldecoder(decoder)() self._encoder = encoder if isinstance(encoder, codecs.IncrementalEncoder) else codecs.getincrementalencoder(encoder)() self._buf = '' self._eol = eol self._reachedEof = False def read(self, size=None): r = self._stream.read(size) raw = self._decoder.decode(r, size is None) return self._encoder.encode(raw) def __iter__(self): return self def __next__(self): if self._reachedEof: raise StopIteration() while True: line,eol,rest = self._buf.partition(self._eol) if eol == self._eol: self._buf = rest return self._encoder.encode(line + eol) raw = self._stream.read(1024) if raw == '': self._decoder.decode(b'', True) self._reachedEof = True return self._encoder.encode(self._buf) self._buf += self._decoder.decode(raw) next = __next__ def close(self): return self._stream.close() with open('test.csv','rb') as f: sr = Recoder(f, 'utf-16', 'utf-8') for row in csv.reader(sr): print (row)
open and codecs.open require the file to start with a BOM. If it doesn't (or you're on Python 2.x), you can still convert it in memory, like this:
codecs.open
try: from io import BytesIO except ImportError: # Python < 2.6 from StringIO import StringIO as BytesIO import csv with open('utf16.csv', 'rb') as binf: c = binf.read().decode('utf-16').encode('utf-8') for line in csv.reader(BytesIO(c)): print(line) # do something with the line
The Python 2.x csv module documentation examples show how to handle other encodings.
I would strongly suggest that you recode your file(s) to UTF-8. Under the very likely condition that you don't have any Unicode characters outside the BMP, you can take advantage of the fact that UTF-16 is a fixed-length encoding to read fixed-length blocks from your input file without worrying about straddling block boundaries.
Step 1: Determine what encoding you actually have. Examine the first few bytes of your file:
print repr(open('thefile.csv', 'rb').read(100))
Four possible ways of encoding u'abc'
u'abc'
\xfe\xff\x00a\x00b\x00c -> utf_16 \xff\xfea\x00b\x00c\x00 -> utf_16 \x00a\x00b\x00c -> utf_16_be a\x00b\x00c\x00 -> utf_16_le
If you have any trouble with this step, edit your question to include the results of the above print repr()
print repr()
Step 2: Here's a Python 2.X recode-UTF-16*-to-UTF-8 script:
import sys infname, outfname, enc = sys.argv[1:4] fi = open(infname, 'rb') fo = open(outfname, 'wb') BUFSIZ = 64 * 1024 * 1024 first = True while 1: buf = fi.read(BUFSIZ) if not buf: break if first and enc == 'utf_16': bom = buf[:2] buf = buf[2:] enc = {'\xfe\xff': 'utf_16_be', '\xff\xfe': 'utf_16_le'}[bom] # KeyError means file doesn't start with a valid BOM first = False fo.write(buf.decode(enc).encode('utf8')) fi.close() fo.close()
Other matters:
You say that your files are too big to read the whole file, recode and rewrite, yet you can open it in vi. Please explain.
vi
The <85> being treated as end of record is a bit of a worry. Looks like 0x85 is being recognised as NEL (C1 control code, NEWLINE). There is a strong possibility that the data was originally encoded in some legacy single-byte encoding where 0x85 has a meaning but has been transcoded to UTF-16 under the false assumption that the original encoding was ISO-8859-1 aka latin1. Where did the file originate? An IBM mainframe? Windows/Unix/classic Mac? What country, locale, language? You obviously think that the <85> is not meant to be a newline; what do you think that it means?
0x85
Please feel free to send a copy of a cut-down file (that includes some of the <85> stuff) to sjmachin at lexicon dot net
sjmachin at lexicon dot net
Update based on 1-line sample data provided.
This confirms my suspicions. Read this. Here's a quote from it:
... the C1 control characters ... are rarely used directly, except on specific platforms such as OpenVMS. When they turn up in documents, Web pages, e-mail messages, etc., which are ostensibly in an ISO-8859-n encoding, their code positions generally refer instead to the characters at that position in a proprietary, system-specific encoding such as Windows-1252 or the Apple Macintosh ("MacRoman") character set that use the codes provided for representation of the C1 set with a single 8-bit byte to instead provide additional graphic characters
This code:
s1 = '\xff\xfe1\x00,\x002\x00,\x00G\x00,\x00S\x00,\x00H\x00 \x00f\x00\xfc\x00r\x00 \x00e\x00 \x00\x96\x00 \x00m\x00 \x00\x85\x00,\x00,\x00I\x00\r\x00\n\x00' s2 = s1.decode('utf16') print 's2 repr:', repr(s2) from unicodedata import name from collections import Counter non_ascii = Counter(c for c in s2 if c >= u'\x80') print 'non_ascii:', non_ascii for c in non_ascii: print "from: U+%04X %s" % (ord(c), name(c, "<no name>")) c2 = c.encode('latin1').decode('cp1252') print "to: U+%04X %s" % (ord(c2), name(c2, "<no name>")) s3 = u''.join( c.encode('latin1').decode('1252') if u'\x80' <= c < u'\xA0' else c for c in s2 ) print 's3 repr:', repr(s3) print 's3:', s3
produces the following (Python 2.7.2 IDLE, Windows 7):
s2 repr: u'1,2,G,S,H f\xfcr e \x96 m \x85,,I\r\n' non_ascii: Counter({u'\x85': 1, u'\xfc': 1, u'\x96': 1}) from: U+0085 <no name> to: U+2026 HORIZONTAL ELLIPSIS from: U+00FC LATIN SMALL LETTER U WITH DIAERESIS to: U+00FC LATIN SMALL LETTER U WITH DIAERESIS from: U+0096 <no name> to: U+2013 EN DASH s3 repr: u'1,2,G,S,H f\xfcr e \u2013 m \u2026,,I\r\n' s3: 1,2,G,S,H für e – m …,,I
Which do you think is a more reasonable interpretation of \x96:
\x96
SPA i.e. Start of Protected Area (Used by block-oriented terminals.) or EN DASH ?
Looks like a thorough analysis of a much larger data sample is warranted. Happy to help.
Just open your file with codecs.open like in
import codecs, csv stream = codecs.open(<yourfile.csv>, encoding="utf-16") reader = csv.reader(stream)
And work through your program with unicode strings, as you should do anyway if you are processing text