Diacritics

From UA Libraries Digital Services Planning and Documentation
(Difference between revisions)
Jump to: navigation, search
m
(added new ways to get rid of diacritics and deleted some of the content related to what is now method 3)
Line 1: Line 1:
 
As metadata spreadsheets exchange hands and are often even created from diverse sources, issues arise regarding diacritics. These characters often do no translate from encoding to encoding, creating poor results in the resultant [[MODS_Mockup|MODS metadata files]].  
 
As metadata spreadsheets exchange hands and are often even created from diverse sources, issues arise regarding diacritics. These characters often do no translate from encoding to encoding, creating poor results in the resultant [[MODS_Mockup|MODS metadata files]].  
  
Encoding problems will appear as black rectangular three letter "blocks" in Notepad++ or diamond shaped question marks in Archivist Utility.
+
Encoding problems will appear as black rectangular three letter "blocks" in Notepad++ or diamond shaped question marks in Archivist Utility. There are a few ways to deal with this
  
Based on our experience with French language diacritics, the following process allowed MODS to be created without the numerous encoding problems that we initially encountered with the French Revolutionary Pamphlets (u0002_0000006).
+
Note:
 +
* Method one is the easiest but requires two programs (ExcelConverter and Notepad++)
 +
* Method two requires OpenOffice Calc, and it's a bit fiddly
 +
* Method three works fine and is close to our usual workflow (Excel and Notepad++), but it is way too labor-intensive for anything beyond a stray diacritic or two
  
  To do these steps, you will need a Windows computer with: Microsoft Excel and [http://notepad-plus.sourceforge.net/ NotePad ++].
+
==Repair Method One: Use ExcelConverter==
  You will also need the TextFX plusing for NotePad ++.
+
  To get plugins for NotePad++ refer to the application documentation with regard to the Plugin Manager.
+
  
==Finding Diacritics==
+
# Run the ExcelConverter script, choose the input file and export location, and click the Convert File! button -- the script defaults to exporting as unicode
 +
# Once the file has exported, open it in Notepad++
 +
# From the Encoding menu, select Encode in UTF-8 without BOM
 +
# Save and close
  
This method uses a tab delimited text export of the metadata spreadsheet. Note: If the Excel file is in the legacy .xls format, first convert it to .xlsx prior to export.
+
==Repair Method Two: Use OpenOffice Calc==
  
# Open the file in NotePad ++ and choose "Encode in UTF-8".
+
# Open the file in OpenOffice Calc
# Use the TextFX plugin and choose: TextFX Characters > zap all non-printable characters to #
+
# From the File menu, select Save As...
# Do a search for all instances of "#"
+
# In the Save dialog window
# If this character shows up, it represents a diacritic -- keep this text export open as a guide for the next step, Repairing Diacritics
+
#* Uncheck Automatic file name extension
# If this character does not show up, continue on to [[Making MODS]]
+
#* Check Edit filter settings
 +
#* Change Save as type to Text CSV
 +
#* Manually change the extension in the File name box from .csv to .txt
 +
#* Click Save
 +
# In the Export Text File window
 +
#* Change Character set to Unicode (UTF-8)
 +
#* Leave Field delimiter as {Tab}
 +
#* Make Text delimiter blank (you'll have to backspace over it manually)
 +
#* Don't change the check boxes
 +
#* Click OK
  
 +
==Repair Method Three: Use Excel==
  
==Repairing Diacritics==
+
# Open the file in Excel
 
+
# Open the Excel spreadsheet for the collection (or collection batch)
+
 
# Use Excel's built in character map to replace all found problems in the Excel file
 
# Use Excel's built in character map to replace all found problems in the Excel file
## To access the character map, follow this path: Insert tab - Symbols group - Symbol
+
#* To access the character map, follow this path: Insert tab - Symbols group - Symbol
## Select the character you need to replace and make sure "Unicode Hex" is selected
+
#* Select the character you need to replace and make sure "Unicode Hex" is selected in the dropdown
## Insert the character
+
#* Insert the character
# Close the text export and delete it (DO NOT CLOSE/DELETE THE SPREADSHEET)
+
# Export the metadata as a Unicode text file
# Export the metadata as a new Unicode text file  
+
# Once the file has exported, open it in Notepad++
# Continue on (or go back) to [[Making MODS]]
+
#* From the Encoding menu, select Encode in UTF-8 without BOM
 
+
#* Save and close
 
+
[http://intranet.lib.ua.edu/wiki/digcoll/images/e/e6/French_encoding_map.xlsx This Excel file] may be useful in getting one started in searching/replacing diacritics for French language metadata.
+

Revision as of 13:36, 23 October 2014

As metadata spreadsheets exchange hands and are often even created from diverse sources, issues arise regarding diacritics. These characters often do no translate from encoding to encoding, creating poor results in the resultant MODS metadata files.

Encoding problems will appear as black rectangular three letter "blocks" in Notepad++ or diamond shaped question marks in Archivist Utility. There are a few ways to deal with this

Note:

  • Method one is the easiest but requires two programs (ExcelConverter and Notepad++)
  • Method two requires OpenOffice Calc, and it's a bit fiddly
  • Method three works fine and is close to our usual workflow (Excel and Notepad++), but it is way too labor-intensive for anything beyond a stray diacritic or two

Repair Method One: Use ExcelConverter

  1. Run the ExcelConverter script, choose the input file and export location, and click the Convert File! button -- the script defaults to exporting as unicode
  2. Once the file has exported, open it in Notepad++
  3. From the Encoding menu, select Encode in UTF-8 without BOM
  4. Save and close

Repair Method Two: Use OpenOffice Calc

  1. Open the file in OpenOffice Calc
  2. From the File menu, select Save As...
  3. In the Save dialog window
    • Uncheck Automatic file name extension
    • Check Edit filter settings
    • Change Save as type to Text CSV
    • Manually change the extension in the File name box from .csv to .txt
    • Click Save
  4. In the Export Text File window
    • Change Character set to Unicode (UTF-8)
    • Leave Field delimiter as {Tab}
    • Make Text delimiter blank (you'll have to backspace over it manually)
    • Don't change the check boxes
    • Click OK

Repair Method Three: Use Excel

  1. Open the file in Excel
  2. Use Excel's built in character map to replace all found problems in the Excel file
    • To access the character map, follow this path: Insert tab - Symbols group - Symbol
    • Select the character you need to replace and make sure "Unicode Hex" is selected in the dropdown
    • Insert the character
  3. Export the metadata as a Unicode text file
  4. Once the file has exported, open it in Notepad++
    • From the Encoding menu, select Encode in UTF-8 without BOM
    • Save and close
Personal tools