I need to be able to parse both CSV and TSV files. I can't rely on the users to know the difference, so I would like to avoid asking the user to select the type. Is there a simple way to detect which delimiter is in use?
One way would be to read in every line and count both tabs and commas and find out which is most consistently used in every line. Of course, the data could include commas or tabs, so that may be easier said than done.
Edit: Another fun aspect of this project is that I will also need to detect the schema of the file when I read it in because it could be one of many. This means that I won't know how many fields I have until I can parse it.
samiz
samizsamizCSV Kit is the best utility that I’ve found for working with CSV files. It’s a free set of tools for dealing with CSV files on Linux. Some of the cool things it can do are: CSV clean will validate and clean the file of common syntax errors. It isn’t magic, but can definitely help. CSV grep is incredibly useful.
78811 gold badge99 silver badges2121 bronze badges
13 Answers
You could show them the results in preview window - similar to the way Excel does it. It's pretty clear when the wrong delimiter is being used in that case. You could then allow them to select a range of delimiters and have the preview update in real time.
Then you could just make a simple guess as to the delimiter to start with (e.g. does a comma or a tab come first).
dommerdommer16.1k77 gold badges5858 silver badges113113 bronze badges
In Python, there is a Sniffer class in the csv module that can be used to guess a given file's delimiter and quote characters. Its strategy is (quoted from csv.py's docstrings):
[First, look] for text enclosed between two identical quotes(the probable quotechar) which are preceded and followedby the same character (the probable delimiter).For example:
The quote with the most wins, same with the delimiter.If there is no quotechar the delimiter can't be determinedthis way.
In that case, try the following:
The delimiter should occur the same number of times oneach row. However, due to malformed data, it may not. We don't wantan all or nothing approach, so we allow for small variations in thisnumber.
- build a table of the frequency ofeach character on every line.
- build a table of freqencies of thisfrequency (meta-frequency?), e.g. 'x occurred 5 times in 10 rows, 6times in 1000 rows, 7 times in 2rows'
- use the mode of the meta-frequencyto determine the expectedfrequency for that character
- find out how often the characteractually meets that goal
- the character that best meets itsgoal is the delimiter
For performance reasons, the data is evaluated in chunks, so it cantry and evaluate the smallest portion of the data possible, evaluatingadditional chunks as necessary.
I'm not going to quote the source code here - it's in the Lib directory of every Python installation.
Remember that CSV can also use semicolons instead of commas as delimiters (e. g. in German versions of Excel, CSVs are semicolon-delimited because commas are used as decimal separators in Germany...)
Tim PietzckerTim Pietzcker256k4545 gold badges396396 silver badges468468 bronze badges
Do you know how many fields should be present per line? If so, I'd read the first few lines of the file and check based on that.
In my experience, 'normal' data quite often contains commas but rarely contains tab characters. This would suggest that you should check for a consistent number of tabs in the first few lines, and go with that choice as a preferred guess. Of course, it depends on exactly what data you've got.
Ultimately, it would be quite possible to have a file which is completely valid for both formats - so you can't make it absolutely foolproof. It'll have to be a 'best effort' job.
Jon SkeetJon Skeet1114k708708 gold badges81188118 silver badges85328532 bronze badges
I ran into a similar need and thought I would share what I came up with. I haven't run a lot of data through it yet, so there are possible edge cases. Also, keep in mind the goal of this function isn't 100% certainty of the delimiter, but best guess to be presented to user.
VermisVermis1,94522 gold badges1212 silver badges1212 bronze badges
It's in PHP but this seems to be quite reliable:
What it does is the following:For every specified possible delimiter, it reads every line in the CSV and checks if the number of times each seperator occurs is constant. If not, the candidate seperator is removed and ultimately you should end up with one seperator.
MartinMartin
I'd imagine that your suggested solution would be the best way to go. In a well-formed CSV or TSV file, the number of commas or tabs respectively per line should be constant (no variation at all). Do a count of each for every line of the file, and check which one is constant for all lines. It would seem quite unlikely that the count of both delimeters for each line is identical, but in this inconceivably rare case, you could of course prompt the user.
If neither the number of tabs nor commas is constant, then display a message to the user telling them that the file is malformed but the program thinks it is a (whatever format has the lowest standard deviation of delimeters per line) file.
NoldorinNoldorin116k4646 gold badges237237 silver badges284284 bronze badges
Just read a few lines, count the number of commas and the number of tabs and compare them. If there's 20 commas and no tabs, it's in CSV. If there's 20 tabs and 2 commas (maybe in the data), it's in TSV.
Andrew EnsleyAndrew Ensley7,3501515 gold badges5353 silver badges7272 bronze badges
Humphrey BogartHumphrey Bogart4,2731111 gold badges4545 silver badges5858 bronze badges
Assuming that there are a fixed number of fields per line and that any commas or tabs within values are enclosed by quotes ('), you should be able to work it out on the frequency of each character in each line. If the fields aren't fixed, this is harder, and if quotes aren't used to enclose otherwise delimiting characters, it will be, I suspect, near impossible (and depending on the data, locale-specific).
Jeff YatesJeff Yates53.5k1717 gold badges132132 silver badges177177 bronze badges
In my experience, data rarely contains tabs, so a line of tab delimited fields would (generally) be fairly obvious.
Commas are more difficult, though - especially if you're reading data in non-US locales. Numerical data can contain huge numbers of commas if you're reading files generated out of country, since floating point numbers will often contain them.
In the end, the only safe thing, though, is usually to try, then present it to the user and allow them to adjust, especially if your data will contain commas and/or tabs.
Reed CopseyReed Copsey477k6060 gold badges10011001 silver badges12861286 bronze badges
I would assume that in normal text, tabs are very rare except as the first character(s) on a line -- think indented paragraphs or source code. I think if you find embedded tabs (i.e. ones that don't follow commas), you can assume that the tabs are being used as the delimiters and be correct most of the time. This is just a hunch, not verified with any research. I'd of course give the user the option to override the auto-calculated mode.
rmeadorrmeador19.6k1414 gold badges5151 silver badges9090 bronze badges
Assuming you have a standard set of columns you are going to expect...
I would use FileHelper (open source project on SourceForge).http://filehelpers.sourceforge.net/
Define two reader templates, one for comas, one for tabs.
If the first one fails, try the second.
Chris BrandsmaChris Brandsma10.2k55 gold badges4040 silver badges5656 bronze badges
You can check whether a line is using one delimiter or another like this:
Laurel4,8471010 gold badges2222 silver badges3939 bronze badges
JackJack
Not the answer you're looking for? Browse other questions tagged c#asp.netcsvtext-parsing or ask your own question.
I have a number of .csv files. Some of them are comma delimited, some are tab delimited (maybe they should be called .tsv ...)
The
csv
extension gets associated with Excel when Excel is installed. However, if I open one of these files with excel, everything gets dumped into the same column and the comma is not interpreted as a delimiter.I can instead of File -> Import..., select the file, and choose the precise way to interpret the contents (delimiters, data types, etc.) But most of the time I just want to look at the file through a clear table view. I do not want to process it with Excel.
Is there a way to get Excel to auto-interpret the delimiter and show the CSV file as a proper table as soon as it's opened? I need this so I can use Excel as a quick viewer for such files.
I suspect there must be a way, otherwise Excel wouldn't associate itself with CSV files.
SzabolcsSzabolcs1,17333 gold badges1515 silver badges3434 bronze badges
8 Answers
While opening CSV files, Excel will use a system regional setting called
List separator
to determine which default delimiter to use.Microsoft Excel will open .csv files, but depending on the system's regional settings, it may expect a semicolon as a separator instead of a comma, since in some languages the comma is used as the decimal separator. (from Wikipedia)
On Windows, you can change the
List separator
setting in the Regional and Language Options
as specified on the Office support website :Change the separator in a CSV text file
- Click the Windows Start menu.
- Click Control Panel.
- Open the Regional and Language Options dialog box.
- Click the Regional Options Tab.
- Click Customize / Additional settings (Win10).
- Type a new separator in the List separator box.
- Click OK twice.
Note: this only works if the Decimal symbol is not also designated as comma (in line with the Wikipedia citation above). If it is, Excel will not use comma as the List separator, even if chosen. For many non-United States regions, comma is the default Decimal symbol.
On Mac OS X, this setting seems to be deduced from the decimal separator setting (in the Language & Region pane of System Preferences, go to Advanced). If the Decimal Separator is a point then the default CSV separator will be a comma, but if the Decimal Separator is a comma, then the default CSV separator will be a semicolon.
As you said yourself in the comment, there is an alternative for Mac users to quickly look at those CSV files. It's plugin for Quick Look called quicklook-csv that handles separator detection.
yosh m1,83333 gold badges1515 silver badges2424 bronze badges
zakinsterzakinster
If you are not looking to modify the format of the file, and are ONLY targeting Excel, you can use the following Excel trick to help you.
Add a new line at the top of the file with the text
'sep=,'
(including quotes) in order for Excel to open the file with ',' as the list separator.It´s a very easy trick to avoid changing your Windows regional settings and get a consistent result. But it is Excel specific.
user280725user280725
You don't need the quotes around the
sep=,
- as long as it's the first line of the file it'll work, at least with Excel 2016.I discovered that if the file is tab delimited,
Vylixsep=t
works fine, with and without the quotes.1,56522 gold badges1616 silver badges2222 bronze badges
Stephen HemingwayStephen Hemingway
Accepted answer is correct but I am a visual person. Here is every single step in screenshot format of how to do this in windows 10.
Sam BSam B
When the separator in the regional settings is not a comma but a semicolon (Dutch separator), rename the CSV file to a TXT file. Right-click the TXT file and select 'Open with' and select 'Excel'. In Excel select the first column, select data in the ribbon and separate text to columns.
OR
Install LibreOffice and open the CSV file with LibreOffice Calc.
robinCTS4,03144 gold badges1515 silver badges2727 bronze badges
Eric VDBEric VDB
Be sure to inspect the CSV file in a simple editor like Notepad to verify it is properly formatted.
I added this answer after I solved a stupid bug wherein CSV files I created with VB weren't opened with separate columns in Excel. I discovered that the way I had written the lines wrapped each line with quotation marks. Excel hid the quotation marks and showed the whole line in column A, making it appear as though it ignored my comma separators.
edjedj
best way will be to save it in a text file with csv extension
AzizDAzizD
For Dutch I had a problem that Excel 2008 on Windows 7 did not adhere to RFC4180:
'Fields containing line breaks (CRLF), double quotes, and commas should be enclosed in double-quotes.'
A comma-separated file correctly had fields with a comma (Dutch decimal separator) enclosed in double quotes. In the Dutch locale the decimal separator is
,
and list separator is ;
so Excel could not read a file with ,
used for both (not even after explicitly changing the list separator to ,
in Control Panel).The input looks like:
The solution was given by @user280725:
Use Notepad to insert as a first line:
(This means the Note of user @zakinster in his solution no longer needs to apply.)
Interesting: with the csv file still loaded, if you now set the locale to US English in Control Panel and save the file, it will have been converted to US English format (comma as list separator and dot as decimal separator).Paul OgilviePaul Ogilvie