Diacritics

From UA Libraries Digital Services Planning and Documentation
(Difference between revisions)
Jump to: navigation, search
(New page: Best to export from Excal as .txt. Then open in Encodinator to identify problems. Also, you can open in NotePad ++ and "Encode in UTF-8" to see where the problems are. Then search and repl...)
 
m
(29 intermediate revisions by 3 users not shown)
Line 1: Line 1:
Best to export from Excal as .txt.
+
As metadata spreadsheets exchange hands and are often even created from diverse sources, issues arise regarding diacritics. These characters often do no translate from encoding to encoding, creating poor results in the resultant [[MODS_Mockup|MODS metadata files]].  
Then open in Encodinator to identify problems. Also, you can open in NotePad ++ and "Encode in UTF-8" to see where the problems are.
+
 
Then search and replace these out in Excel .xlsx.
+
Encoding problems will appear as black rectangular three letter "blocks" in Notepad++ or diamond shaped question marks in Archivist Utility. 
Then export as Unicode.
+
 
Then in Notepad ++ "Convert to ANSI", then reload into Encodinator and makes sure all is well.
+
Based on our experience with French language diacritics, the following process allowed MODS to be created without the numerous encoding problems that we initially encountered with the French Revolutionary Pamphlets (u0002_0000006).
Then take that txt file in ANSI and "Convert to UTF-8" in NotePad ++.
+
 
Then take it to Archivist Utility.
+
  To do these steps, you will need a Windows computer with: Microsoft Excel and [http://notepad-plus.sourceforge.net/ NotePad ++].
 +
  You will also need the TextFX plusing for NotePad ++.
 +
  To get plugins for NotePad++ refer to the application documentation with regard to the Plugin Manager.
 +
 
 +
==Finding Diacritics==
 +
 
 +
This method uses a tab delimited text export of the metadata spreadsheet. Note: If the Excel file is in the legacy .xls format, first convert it to .xlsx prior to export.
 +
 
 +
# Open the file in NotePad ++ and choose "Encode in UTF-8".
 +
# Use the TextFX plugin and choose: TextFX Characters > zap all non-printable characters to #
 +
# Do a search for all instances of "#"
 +
# If this character shows up, it represents a diacritic -- keep this text export open as a guide for the next step, Repairing Diacritics
 +
# If this character does not show up, continue on to [[Making MODS]]
 +
 
 +
 
 +
==Repairing Diacritics==
 +
 
 +
# Open the Excel spreadsheet for the collection (or collection batch)
 +
# Use Excel's built in character map to replace all found problems in the Excel file
 +
## To access the character map, follow this path: Insert tab - Symbols group - Symbol
 +
## Select the character you need to replace and make sure "Unicode Hex" is selected
 +
## Insert the character
 +
# Close the text export and delete it (DO NOT CLOSE/DELETE THE SPREADSHEET)
 +
# Export the metadata as a new Unicode text file
 +
# Continue on (or go back) to [[Making MODS]]
 +
 
 +
 
 +
[http://intranet.lib.ua.edu/wiki/digcoll/images/e/e6/French_encoding_map.xlsx This Excel file] may be useful in getting one started in searching/replacing diacritics for French language metadata.

Revision as of 12:03, 19 February 2013

As metadata spreadsheets exchange hands and are often even created from diverse sources, issues arise regarding diacritics. These characters often do no translate from encoding to encoding, creating poor results in the resultant MODS metadata files.

Encoding problems will appear as black rectangular three letter "blocks" in Notepad++ or diamond shaped question marks in Archivist Utility.

Based on our experience with French language diacritics, the following process allowed MODS to be created without the numerous encoding problems that we initially encountered with the French Revolutionary Pamphlets (u0002_0000006).

  To do these steps, you will need a Windows computer with: Microsoft Excel and NotePad ++.
  You will also need the TextFX plusing for NotePad ++.
  To get plugins for NotePad++ refer to the application documentation with regard to the Plugin Manager.

Finding Diacritics

This method uses a tab delimited text export of the metadata spreadsheet. Note: If the Excel file is in the legacy .xls format, first convert it to .xlsx prior to export.

  1. Open the file in NotePad ++ and choose "Encode in UTF-8".
  2. Use the TextFX plugin and choose: TextFX Characters > zap all non-printable characters to #
  3. Do a search for all instances of "#"
  4. If this character shows up, it represents a diacritic -- keep this text export open as a guide for the next step, Repairing Diacritics
  5. If this character does not show up, continue on to Making MODS


Repairing Diacritics

  1. Open the Excel spreadsheet for the collection (or collection batch)
  2. Use Excel's built in character map to replace all found problems in the Excel file
    1. To access the character map, follow this path: Insert tab - Symbols group - Symbol
    2. Select the character you need to replace and make sure "Unicode Hex" is selected
    3. Insert the character
  3. Close the text export and delete it (DO NOT CLOSE/DELETE THE SPREADSHEET)
  4. Export the metadata as a new Unicode text file
  5. Continue on (or go back) to Making MODS


This Excel file may be useful in getting one started in searching/replacing diacritics for French language metadata.

Personal tools