CSV files with UTF8 BOM

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

CSV files with UTF8 BOM

Andrew Burton
Hi list,

Is there any appetite for handling UTF-8 with BOM markers automatically
when loading CSV input files? These currently fail silently since the first
character in the file is the BOM marker, which means CSV files with headers
don't create the correct variable name.

I *know* that technically, the BOM variant isn't an official UTF variant,
but it is commonplace when exporting from MS SQL Server (which for a lot of
Windows-based users might be their way of generating data).

I know we can convert the encoding from UTF8 BOM to UTF8 using, e.g.
Notepad++ or dos2unix but this adds an extra step to fix a problem that a
lot of users would struggle to identify in the first place ("My data file
is not working, and it looks fine when I open it in Notepad!")

(SQL Server does provide an option to output Unicode but this is UTF16, not
UTF8, which is a whole other story).

I'd propose an additonal step of identifying the file's encoding using
getEncoding() method in InputStreamReader) and if UTF8, checking if it has
a BOM marker and if so, handling it with the BOMInputStream class in apache
commons-io (ref
https://commons.apache.org/proper/commons-io/javadocs/api-release/org/apache/commons/io/input/BOMInputStream.html
).

One other thing that might be useful is changing the input field of the
CSVDataSet for encoding to be a drop down list with only the charset values
supported by InputStreamReader (ref
https://docs.oracle.com/javase/8/docs/api/?java/io/InputStreamReader.html
and https://docs.oracle.com/javase/8/docs/api/java/nio/charset/Charset.html).
The documentation doesn't list which encodings are valid (I had to dig
through the code to find the relevant handling class) and there's always
the risk of a typo.

I'm happy to spend some time on this if it was something that core devs
would find useful.

Regards

Andrew
Reply | Threaded
Open this post in threaded view
|

Re: CSV files with UTF8 BOM

Felix Schumacher


Am 20.07.2018 um 01:47 schrieb Andrew Burton:

> Hi list,
>
> Is there any appetite for handling UTF-8 with BOM markers automatically
> when loading CSV input files? These currently fail silently since the first
> character in the file is the BOM marker, which means CSV files with headers
> don't create the correct variable name.
>
> I *know* that technically, the BOM variant isn't an official UTF variant,
> but it is commonplace when exporting from MS SQL Server (which for a lot of
> Windows-based users might be their way of generating data).
>
> I know we can convert the encoding from UTF8 BOM to UTF8 using, e.g.
> Notepad++ or dos2unix but this adds an extra step to fix a problem that a
> lot of users would struggle to identify in the first place ("My data file
> is not working, and it looks fine when I open it in Notepad!")
>
> (SQL Server does provide an option to output Unicode but this is UTF16, not
> UTF8, which is a whole other story).
>
> I'd propose an additonal step of identifying the file's encoding using
> getEncoding() method in InputStreamReader) and if UTF8, checking if it has
> a BOM marker and if so, handling it with the BOMInputStream class in apache
> commons-io (ref
> https://commons.apache.org/proper/commons-io/javadocs/api-release/org/apache/commons/io/input/BOMInputStream.html
> ).
>
> One other thing that might be useful is changing the input field of the
> CSVDataSet for encoding to be a drop down list with only the charset values
> supported by InputStreamReader (ref
> https://docs.oracle.com/javase/8/docs/api/?java/io/InputStreamReader.html
> and https://docs.oracle.com/javase/8/docs/api/java/nio/charset/Charset.html).
> The documentation doesn't list which encodings are valid (I had to dig
> through the code to find the relevant handling class) and there's always
> the risk of a typo.
>
> I'm happy to spend some time on this if it was something that core devs
> would find useful.
Looks like a good idea, especially since we already have commons-io on
the classpath.
If I read it correctly, it could be enough to use BOMInputStream and let
it automatically decide the encoding based on the presence of the BOM.

Just open a bugzilla entry with an enhancement request and add a patch
to it.

Regards,
 ¬†Felix

> Regards
>
> Andrew
>


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]