Diacritics

From UA Libraries Digital Services Planning and Documentation
Jump to: navigation, search

As metadata spreadsheets exchange hands and are often even created from diverse sources, issues arise regarding diacritics. These characters often do no translate from encoding to encoding, creating poor results in the resultant MODS metadata files.

Encoding problems will appear as black rectangular three letter "blocks" in Notepad++ or diamond shaped question marks in Archivist Utility. There are a few ways to deal with this

Note:

  • Method one is the easiest but requires two programs (ExcelConverter and Notepad++)
  • Method two requires OpenOffice Calc, and it's a bit fiddly
  • Method three works fine and is close to our usual workflow (Excel and Notepad++), but it is way too labor-intensive for anything beyond a stray diacritic or two

Repair Method One: Use ExcelConverter

  1. Run the ExcelConverter script, choose the input file and export location, and click the Convert File! button -- the script defaults to exporting as unicode
  2. Once the file has exported, open it in Notepad++
    • From the Encoding menu, select Encode in UTF-8 without BOM
    • Save and close

Repair Method Two: Use OpenOffice Calc

  1. Open the file in OpenOffice Calc
  2. From the File menu, select Save As...
  3. In the Save dialog window
    • Uncheck Automatic file name extension
    • Check Edit filter settings
    • Change Save as type to Text CSV
    • Manually change the extension in the File name box from .csv to .txt
    • Click Save
  4. In the Export Text File window
    • Change Character set to Unicode (UTF-8)
    • Leave Field delimiter as {Tab}
    • Make Text delimiter blank (you'll have to backspace over it manually)
    • Don't change the check boxes
    • Click OK

CalcForDiacritics.png

Repair Method Three: Use Excel

  1. Open the file in Excel
  2. Use Excel's built in character map to replace all found problems in the Excel file
    • To access the character map, follow this path: Insert tab - Symbols group - Symbol
    • Select the character you need to replace and make sure "Unicode Hex" is selected in the dropdown
    • Insert the character
  3. Export the metadata as a Unicode text file
  4. Once the file has exported, open it in Notepad++
    • From the Encoding menu, select Encode in UTF-8 without BOM
    • Save and close
Personal tools