Diacritics

From UA Libraries Digital Services Planning and Documentation
(Difference between revisions)
Jump to: navigation, search
(Repair Method Two: Use OpenOffice Calc)
 
(7 intermediate revisions by 2 users not shown)
Line 1: Line 1:
 
As metadata spreadsheets exchange hands and are often even created from diverse sources, issues arise regarding diacritics. These characters often do no translate from encoding to encoding, creating poor results in the resultant [[MODS_Mockup|MODS metadata files]].  
 
As metadata spreadsheets exchange hands and are often even created from diverse sources, issues arise regarding diacritics. These characters often do no translate from encoding to encoding, creating poor results in the resultant [[MODS_Mockup|MODS metadata files]].  
  
 +
Encoding problems will appear as black rectangular three letter "blocks" in Notepad++ or diamond shaped question marks in Archivist Utility. There are a few ways to deal with this
  
Based on our experience with French language diacritics, the following process allowed MODS to be created without the numerous encoding problems that we initially encountered with the collection: u0002_0000006 (French Revolutionary Pamphlets).
+
Note:
 +
* Method one is the easiest but requires two programs (ExcelConverter and Notepad++)
 +
* Method two requires OpenOffice Calc, and it's a bit fiddly
 +
* Method three works fine and is close to our usual workflow (Excel and Notepad++), but it is way too labor-intensive for anything beyond a stray diacritic or two
  
  To do these steps, you will need a Windows computer with: Microsoft Excel and [http://notepad-plus.sourceforge.net/ NotePad ++].
+
==Repair Method One: Use ExcelConverter==
  You will also need the TextFX plusing for NotePad ++.
+
  To get plugins for NotePad++ refer to the application documentation with regard to the Plugin Manager.
+
  
1. Export the metadata from Excel as a tab delimited text file (use the Unicode option). If the Excel file is in the legacy .xls format, first convert it to .xlsx prior to export.
+
# Run the ExcelConverter script, choose the input file and export location, and click the Convert File! button -- the script defaults to exporting as unicode
 +
# Once the file has exported, open it in Notepad++
 +
#* From the Encoding menu, select Encode in UTF-8 without BOM
 +
#* Save and close
  
 +
==Repair Method Two: Use OpenOffice Calc==
  
2. Open in the file in NotePad ++ and choose "Encode in UTF-8".
+
# Open the file in OpenOffice Calc
 +
# From the File menu, select Save As...
 +
# In the Save dialog window
 +
#* Uncheck Automatic file name extension
 +
#* Check Edit filter settings
 +
#* Change Save as type to Text CSV
 +
#* Manually change the extension in the File name box from .csv to .txt
 +
#* Click Save
 +
# In the Export Text File window
 +
#* Change Character set to Unicode (UTF-8)
 +
#* Leave Field delimiter as {Tab}
 +
#* Make Text delimiter blank (you'll have to backspace over it manually)
 +
#* Don't change the check boxes
 +
#* Click OK
  
 +
[[Image:CalcForDiacritics.png]]
  
3. Use the TextFX plugin and choose: TextFX Characters>zap all non-printable characters to #
+
==Repair Method Three: Use Excel==
  
 
+
# Open the file in Excel
4. Do a search for all instances of "#" and replace all found problems in the *Excel* (.xlsx) file. Use Excel's built in character map. To access the character map window from within excel goto the Insert tab - Symbols group - Symbol, select the character you need to replace and choose "Unicode Hex" as the encoding while making changes.
+
# Use Excel's built in character map to replace all found problems in the Excel file
 
+
#* To access the character map, follow this path: Insert tab - Symbols group - Symbol
[http://intranet.lib.ua.edu/wiki/digcoll/images/e/e6/French_encoding_map.xlsx This Excel file] may be useful in getting one started in searching/replacing diacritics for French language metadata.
+
#* Select the character you need to replace and make sure "Unicode Hex" is selected in the dropdown
 
+
#* Insert the character
 
+
# Export the metadata as a Unicode text file
4. From Excel, export the metadata as a Unicode tab-delimited file.
+
# Once the file has exported, open it in Notepad++
 
+
#* From the Encoding menu, select Encode in UTF-8 without BOM
 
+
#* Save and close
5. Open the Unicode export in Notepad ++ and choose "Encode in UTF-8". Repeat Step 3 and 4 until all problems are taken care of.
+
 
+
 
+
6. If all is well, save the text file with NotePad ++ (it will be a UTF-8 file).
+
  This UTF-8 file is now the file from which to create MODS with [http://acumen.lib.ua.edu/project/?f=Archivist%20Utility.txt Archivist Utility]. This is also the text version of the metadata that will go into long term Storage.
+

Latest revision as of 14:45, 23 October 2014

As metadata spreadsheets exchange hands and are often even created from diverse sources, issues arise regarding diacritics. These characters often do no translate from encoding to encoding, creating poor results in the resultant MODS metadata files.

Encoding problems will appear as black rectangular three letter "blocks" in Notepad++ or diamond shaped question marks in Archivist Utility. There are a few ways to deal with this

Note:

  • Method one is the easiest but requires two programs (ExcelConverter and Notepad++)
  • Method two requires OpenOffice Calc, and it's a bit fiddly
  • Method three works fine and is close to our usual workflow (Excel and Notepad++), but it is way too labor-intensive for anything beyond a stray diacritic or two

[edit] Repair Method One: Use ExcelConverter

  1. Run the ExcelConverter script, choose the input file and export location, and click the Convert File! button -- the script defaults to exporting as unicode
  2. Once the file has exported, open it in Notepad++
    • From the Encoding menu, select Encode in UTF-8 without BOM
    • Save and close

[edit] Repair Method Two: Use OpenOffice Calc

  1. Open the file in OpenOffice Calc
  2. From the File menu, select Save As...
  3. In the Save dialog window
    • Uncheck Automatic file name extension
    • Check Edit filter settings
    • Change Save as type to Text CSV
    • Manually change the extension in the File name box from .csv to .txt
    • Click Save
  4. In the Export Text File window
    • Change Character set to Unicode (UTF-8)
    • Leave Field delimiter as {Tab}
    • Make Text delimiter blank (you'll have to backspace over it manually)
    • Don't change the check boxes
    • Click OK

CalcForDiacritics.png

[edit] Repair Method Three: Use Excel

  1. Open the file in Excel
  2. Use Excel's built in character map to replace all found problems in the Excel file
    • To access the character map, follow this path: Insert tab - Symbols group - Symbol
    • Select the character you need to replace and make sure "Unicode Hex" is selected in the dropdown
    • Insert the character
  3. Export the metadata as a Unicode text file
  4. Once the file has exported, open it in Notepad++
    • From the Encoding menu, select Encode in UTF-8 without BOM
    • Save and close
Personal tools