主題: R讀取中文檔案產生亂碼等錯誤問題
說明:
# 2022.08.16 更新
感謝R友-HJ提供此問題.
問題: 使用 Windows 10 企業版, 匯入資料有亂碼或是有問題.
分析:
# Windows 10 企業版
# R-4.2.1
# RStudio Desktop 2022.07.1+554
# gfc.csv download
# https://github.com/rwepa/DataDemo/blob/master/gfc.csv
Sys.getlocale()
# "LC_COLLATE=Chinese (Traditional)_Taiwan.1252 ...
# method 1 default locale
gfc1 <- read.table("gfc.csv", header = TRUE, sep = ",")
head(gfc1, n=3)
# 亂碼
# ï..orderdate supplier amount
# 1 2009/1/3 日亞 266
# 2 2009/1/4 廣é\u0081” 123
# 3 2009/1/5 廣é\u0081” 66
# method 2 set locale
Sys.setlocale(category="LC_ALL", locale = "English_United States.1252")
Sys.getlocale()
# "LC_COLLATE=English_United States.125 ...
gfc2 <- read.table("gfc.csv", header = TRUE, sep = ",", encoding = "UTF-8")
head(gfc2, n=3)
# 亂碼
# X.U.FEFF.orderdate supplier amount
# 1 2009/1/3 <U+65E5><U+4E9E> 266
# 2 2009/1/4 <U+5EE3><U+9054> 123
# 3 2009/1/5 <U+5EE3><U+9054> 66
head(as.character(gfc2$supplier)) # 中文正常
# "日亞" "廣達" "廣達" "日亞" "廣達" "科銳"
# end
# 2022.5.4 更新
R-4.2.0版本直接支援CSV檔案為 UTF-8-BOM 編碼,
將檔案另存為ANSI編碼時, read.table 會有錯誤
myfile <- "gfc.csv"
gfc <- read.table(myfile, header = TRUE, sep = ",")
head(gfc)
# 2021.11.03 更新
# 直接讀取網路中文檔案
urls <- "https://raw.githubusercontent.com/rwepa/DataDemo/master/gfc.csv"
gfc <- read.table(urls, header = TRUE, sep = ",", fileEncoding = "UTF-8-BOM")
head(gfc)
# read.table
# encoding="UTF-8-BOM"
# ANSI
- 感謝R友-阿賢提供 encoding="UTF-8-BOM"解決亂碼問題.
- 使用R讀取文字檔時, 有時會遇到資料匯入有錯誤訊息或中文亂碼問題.
- 資料來源: https://data.gov.tw/dataset/35131, 匯入 open data 空氣品質監測小時值(一般污染物,每日更新) 所產生的問題與解決方式.
- 匯入資料 read.table {utils} 常用參數:
(1). fill = TRUE --> 使用時機: 錯誤訊息為 line x did not have xxx elements.
(2). encoding --> 結果為亂碼.
(3). fileEncoding --> 結果為亂碼. - 考慮 Windows 執行環境, 如果有亂碼問題, 最簡單的解決方式之一是使用記事本開啟檔案, 另存新檔 畫面中, 編碼改為 ANSI.
- 使用 Notepad++ [https://notepad-plus-plus.org/zh/] 開啟檔案, 在視窗右下角狀態列會有"UTF-8-BOM"編碼, 此時可加上 encoding 或 fileEncoding 參數, 本例使用 fileEncoding = "UTF-8-BOM" 參數即可完成匯入資料.
- 另可參考: 資料集為CSV檔,打開來為亂碼,怎麼辦? https://data.gov.tw/node/18765
- 結論: 使用另存ANSI編碼或加入fileEncoding="UTF-8-BOM"參數應該可以解決亂碼問題.
# R程式碼:
x1 <- read.table("ATM00626_20170910170405.csv", header=TRUE, sep=",") # line 1 did not have 31 elements
x2 <- read.table("ATM00626_20170910170405.csv", header=TRUE, sep=",", encoding="UTF-8") # line 1 did not have 31 elements
x3 <- read.table("ATM00626_20170910170405.csv", header=TRUE, sep=",", encoding="UTF-8", fill=TRUE) # 欄位錯誤
x3[1:3, 1:6]
x4 <- read.table("ATM00626_20170910170405.csv", header=TRUE, sep=",", encoding="UTF-8-BOM") # line 1 did not have 31 elements
x5 <- read.table("ATM00626_20170910170405.csv", header=TRUE, sep=",", encoding="UTF-8-BOM", fill=TRUE) # 亂碼
x5[1:3, 1:6]
x6 <- read.table("ATM00626_20170910170405.csv", header=TRUE, sep=",", fileEncoding = "UTF-8-BOM", fill=TRUE) # OK
x6[1:3, 1:6]
x7 <- read.table("ATM00626_20170829185638-ansi.csv", header=TRUE, sep=",") # OK
x7[1:3, 1:6]
x8 <- read.csv("ATM00626_20170910170405.csv", header=TRUE) # 亂碼
x8[1:3, 1:6]
x9 <- read.csv("ATM00626_20170910170405.csv", header=TRUE, fileEncoding = "UTF-8") # invalid input found
x10 <- read.csv("ATM00626_20170910170405.csv", header=TRUE, encoding = "UTF-8-BOM") # 亂碼
x10[1:3, 1:6]
x11 <- read.csv("ATM00626_20170910170405.csv", header=TRUE, fileEncoding = "UTF-8-BOM") # OK
x11[1:3, 1:6]
# end
沒有留言:
張貼留言