Bad Csv File Verify That Your Delimiters Are Consistent

I need to be able to parse both CSV and TSV files. I can't rely on the users to know the difference, so I would like to avoid asking the user to select the type. Is there a simple way to detect which delimiter is in use?

One way would be to read in every line and count both tabs and commas and find out which is most consistently used in every line. Of course, the data could include commas or tabs, so that may be easier said than done.

Edit: Another fun aspect of this project is that I will also need to detect the schema of the file when I read it in because it could be one of many. This means that I won't know how many fields I have until I can parse it.

samiz

samizsamiz

CSV Kit is the best utility that I’ve found for working with CSV files. It’s a free set of tools for dealing with CSV files on Linux. Some of the cool things it can do are: CSV clean will validate and clean the file of common syntax errors. It isn’t magic, but can definitely help. CSV grep is incredibly useful.

7881 gold badge9 silver badges21 bronze badges

13 Answers

You could show them the results in preview window - similar to the way Excel does it. It's pretty clear when the wrong delimiter is being used in that case. You could then allow them to select a range of delimiters and have the preview update in real time.

Then you could just make a simple guess as to the delimiter to start with (e.g. does a comma or a tab come first).

dommerdommer

16.1k7 gold badges58 silver badges113 bronze badges

In Python, there is a Sniffer class in the csv module that can be used to guess a given file's delimiter and quote characters. Its strategy is (quoted from csv.py's docstrings):

[First, look] for text enclosed between two identical quotes(the probable quotechar) which are preceded and followedby the same character (the probable delimiter).For example:

The quote with the most wins, same with the delimiter.If there is no quotechar the delimiter can't be determinedthis way.

In that case, try the following:

The delimiter should occur the same number of times oneach row. However, due to malformed data, it may not. We don't wantan all or nothing approach, so we allow for small variations in thisnumber.

build a table of the frequency ofeach character on every line.
build a table of freqencies of thisfrequency (meta-frequency?), e.g. 'x occurred 5 times in 10 rows, 6times in 1000 rows, 7 times in 2rows'
use the mode of the meta-frequencyto determine the expectedfrequency for that character
find out how often the characteractually meets that goal
the character that best meets itsgoal is the delimiter

For performance reasons, the data is evaluated in chunks, so it cantry and evaluate the smallest portion of the data possible, evaluatingadditional chunks as necessary.

I'm not going to quote the source code here - it's in the Lib directory of every Python installation.

Bad csv file verify that your delimiters are consistent meaning

Remember that CSV can also use semicolons instead of commas as delimiters (e. g. in German versions of Excel, CSVs are semicolon-delimited because commas are used as decimal separators in Germany...)

Tim PietzckerTim Pietzcker

256k45 gold badges396 silver badges468 bronze badges

Do you know how many fields should be present per line? If so, I'd read the first few lines of the file and check based on that.

In my experience, 'normal' data quite often contains commas but rarely contains tab characters. This would suggest that you should check for a consistent number of tabs in the first few lines, and go with that choice as a preferred guess. Of course, it depends on exactly what data you've got.

Ultimately, it would be quite possible to have a file which is completely valid for both formats - so you can't make it absolutely foolproof. It'll have to be a 'best effort' job.

Jon SkeetJon Skeet

1114k708 gold badges8118 silver badges8532 bronze badges

I ran into a similar need and thought I would share what I came up with. I haven't run a lot of data through it yet, so there are possible edge cases. Also, keep in mind the goal of this function isn't 100% certainty of the delimiter, but best guess to be presented to user.

VermisVermis

1,9452 gold badges12 silver badges12 bronze badges

It's in PHP but this seems to be quite reliable:

What it does is the following:For every specified possible delimiter, it reads every line in the CSV and checks if the number of times each seperator occurs is constant. If not, the candidate seperator is removed and ultimately you should end up with one seperator.

MartinMartin

I'd imagine that your suggested solution would be the best way to go. In a well-formed CSV or TSV file, the number of commas or tabs respectively per line should be constant (no variation at all). Do a count of each for every line of the file, and check which one is constant for all lines. It would seem quite unlikely that the count of both delimeters for each line is identical, but in this inconceivably rare case, you could of course prompt the user.

If neither the number of tabs nor commas is constant, then display a message to the user telling them that the file is malformed but the program thinks it is a (whatever format has the lowest standard deviation of delimeters per line) file.

NoldorinNoldorin

116k46 gold badges237 silver badges284 bronze badges

Just read a few lines, count the number of commas and the number of tabs and compare them. If there's 20 commas and no tabs, it's in CSV. If there's 20 tabs and 2 commas (maybe in the data), it's in TSV.

Andrew EnsleyAndrew Ensley

7,35015 gold badges53 silver badges72 bronze badges

Humphrey BogartHumphrey Bogart

4,27311 gold badges45 silver badges58 bronze badges

Assuming that there are a fixed number of fields per line and that any commas or tabs within values are enclosed by quotes ('), you should be able to work it out on the frequency of each character in each line. If the fields aren't fixed, this is harder, and if quotes aren't used to enclose otherwise delimiting characters, it will be, I suspect, near impossible (and depending on the data, locale-specific).

Jeff YatesJeff Yates

53.5k17 gold badges132 silver badges177 bronze badges

In my experience, data rarely contains tabs, so a line of tab delimited fields would (generally) be fairly obvious.

Commas are more difficult, though - especially if you're reading data in non-US locales. Numerical data can contain huge numbers of commas if you're reading files generated out of country, since floating point numbers will often contain them.

In the end, the only safe thing, though, is usually to try, then present it to the user and allow them to adjust, especially if your data will contain commas and/or tabs.

Reed CopseyReed Copsey

477k60 gold badges1001 silver badges1286 bronze badges

I would assume that in normal text, tabs are very rare except as the first character(s) on a line -- think indented paragraphs or source code. I think if you find embedded tabs (i.e. ones that don't follow commas), you can assume that the tabs are being used as the delimiters and be correct most of the time. This is just a hunch, not verified with any research. I'd of course give the user the option to override the auto-calculated mode.

rmeadorrmeador

19.6k14 gold badges51 silver badges90 bronze badges

Assuming you have a standard set of columns you are going to expect...

I would use FileHelper (open source project on SourceForge).http://filehelpers.sourceforge.net/

Define two reader templates, one for comas, one for tabs.

If the first one fails, try the second.

Chris BrandsmaChris Brandsma

10.2k5 gold badges40 silver badges56 bronze badges

You can check whether a line is using one delimiter or another like this:

Laurel

4,84710 gold badges22 silver badges39 bronze badges

JackJack

Not the answer you're looking for? Browse other questions tagged c#asp.netcsvtext-parsing or ask your own question.

I have a number of .csv files. Some of them are comma delimited, some are tab delimited (maybe they should be called .tsv ...)

The csv extension gets associated with Excel when Excel is installed. However, if I open one of these files with excel, everything gets dumped into the same column and the comma is not interpreted as a delimiter.

I can instead of File -> Import..., select the file, and choose the precise way to interpret the contents (delimiters, data types, etc.) But most of the time I just want to look at the file through a clear table view. I do not want to process it with Excel.

Is there a way to get Excel to auto-interpret the delimiter and show the CSV file as a proper table as soon as it's opened? I need this so I can use Excel as a quick viewer for such files.

I suspect there must be a way, otherwise Excel wouldn't associate itself with CSV files.

SzabolcsSzabolcs

1,1733 gold badges15 silver badges34 bronze badges

8 Answers

While opening CSV files, Excel will use a system regional setting called List separator to determine which default delimiter to use.

Microsoft Excel will open .csv files, but depending on the system's regional settings, it may expect a semicolon as a separator instead of a comma, since in some languages the comma is used as the decimal separator. (from Wikipedia)

On Windows, you can change the List separator setting in the Regional and Language Options as specified on the Office support website :

Change the separator in a CSV text file

Click the Windows Start menu.
Click Control Panel.
Open the Regional and Language Options dialog box.
Click the Regional Options Tab.
Click Customize / Additional settings (Win10).
Type a new separator in the List separator box.
Click OK twice.

Note: this only works if the Decimal symbol is not also designated as comma (in line with the Wikipedia citation above). If it is, Excel will not use comma as the List separator, even if chosen. For many non-United States regions, comma is the default Decimal symbol.

On Mac OS X, this setting seems to be deduced from the decimal separator setting (in the Language & Region pane of System Preferences, go to Advanced). If the Decimal Separator is a point then the default CSV separator will be a comma, but if the Decimal Separator is a comma, then the default CSV separator will be a semicolon.

As you said yourself in the comment, there is an alternative for Mac users to quickly look at those CSV files. It's plugin for Quick Look called quicklook-csv that handles separator detection.

yosh m

1,8333 gold badges15 silver badges24 bronze badges

zakinsterzakinster

If you are not looking to modify the format of the file, and are ONLY targeting Excel, you can use the following Excel trick to help you.

Add a new line at the top of the file with the text 'sep=,' (including quotes) in order for Excel to open the file with ',' as the list separator.

It´s a very easy trick to avoid changing your Windows regional settings and get a consistent result. But it is Excel specific.

user280725user280725

You don't need the quotes around the sep=, - as long as it's the first line of the file it'll work, at least with Excel 2016.

I discovered that if the file is tab delimited, sep=t works fine, with and without the quotes.

Vylix

1,5652 gold badges16 silver badges22 bronze badges

Stephen HemingwayStephen Hemingway

Accepted answer is correct but I am a visual person. Here is every single step in screenshot format of how to do this in windows 10.

Sam BSam B

When the separator in the regional settings is not a comma but a semicolon (Dutch separator), rename the CSV file to a TXT file. Right-click the TXT file and select 'Open with' and select 'Excel'. In Excel select the first column, select data in the ribbon and separate text to columns.

Install LibreOffice and open the CSV file with LibreOffice Calc.

robinCTS

4,0314 gold badges15 silver badges27 bronze badges

Eric VDBEric VDB

Be sure to inspect the CSV file in a simple editor like Notepad to verify it is properly formatted.

I added this answer after I solved a stupid bug wherein CSV files I created with VB weren't opened with separate columns in Excel. I discovered that the way I had written the lines wrapped each line with quotation marks. Excel hid the quotation marks and showed the whole line in column A, making it appear as though it ignored my comma separators.

edjedj

best way will be to save it in a text file with csv extension

AzizDAzizD

For Dutch I had a problem that Excel 2008 on Windows 7 did not adhere to RFC4180:

'Fields containing line breaks (CRLF), double quotes, and commas should be enclosed in double-quotes.'

A comma-separated file correctly had fields with a comma (Dutch decimal separator) enclosed in double quotes. In the Dutch locale the decimal separator is , and list separator is ; so Excel could not read a file with , used for both (not even after explicitly changing the list separator to , in Control Panel).

The input looks like:

The solution was given by @user280725:

Use Notepad to insert as a first line:

(This means the Note of user @zakinster in his solution no longer needs to apply.)

Interesting: with the csv file still loaded, if you now set the locale to US English in Control Panel and save the file, it will have been converted to US English format (comma as list separator and dot as decimal separator).

Paul OgilviePaul Ogilvie

13 Answers

Not the answer you're looking for? Browse other questions tagged c#asp.netcsvtext-parsing or ask your own question.

8 Answers

Not the answer you're looking for? Browse other questions tagged microsoft-excelcsv or ask your own question.