Diacritics

From UA Libraries Digital Services Planning and Documentation
(Difference between revisions)
Jump to: navigation, search
m
 
(12 intermediate revisions by 3 users not shown)
Line 1: Line 1:
 
As metadata spreadsheets exchange hands and are often even created from diverse sources, issues arise regarding diacritics. These characters often do no translate from encoding to encoding, creating poor results in the resultant [[MODS_Mockup|MODS metadata files]].  
 
As metadata spreadsheets exchange hands and are often even created from diverse sources, issues arise regarding diacritics. These characters often do no translate from encoding to encoding, creating poor results in the resultant [[MODS_Mockup|MODS metadata files]].  
  
 +
Encoding problems will appear as black rectangular three letter "blocks" in Notepad++ or diamond shaped question marks in Archivist Utility. 
  
Based on our experience with French language diacritics, the following process allowed MODS to be created without the numerous encoding problems that we initially encountered with the collection: u0002_0000006 (French Revolutionary Pamphlets).
+
Based on our experience with French language diacritics, the following process allowed MODS to be created without the numerous encoding problems that we initially encountered with the French Revolutionary Pamphlets (u0002_0000006).
  
 +
  To do these steps, you will need a Windows computer with: Microsoft Excel and [http://notepad-plus.sourceforge.net/ NotePad ++].
 +
  You will also need the TextFX plusing for NotePad ++.
 +
  To get plugins for NotePad++ refer to the application documentation with regard to the Plugin Manager.
  
1. Export the metadata from Excel as a tab delimited text file (Do Not use the Unicode option). If the Excel file is in the legacy .xls, first convert it to .xlsx prior to export.
+
==Finding Diacritics==
  
 +
This method uses a tab delimited text export of the metadata spreadsheet. Note: If the Excel file is in the legacy .xls format, first convert it to .xlsx prior to export.
  
2. Open in the text file in [http://lb-416-003.lib.ua-net.ua.edu/notes/?f=Encodinator.txt Encodinator], which will help identify problems. Also, you can open in the file in [http://notepad-plus.sourceforge.net/ NotePad ++] and choose "Encode in UTF-8", as this will also help show where the problems are.
+
# Open the file in NotePad ++ and choose "Encode in UTF-8".
 +
# Use the TextFX plugin and choose: TextFX Characters > zap all non-printable characters to #
 +
# Do a search for all instances of "#"
 +
# If this character shows up, it represents a diacritic -- keep this text export open as a guide for the next step, Repairing Diacritics
 +
# If this character does not show up, continue on to [[Making MODS]]
  
  
3. Search and replace all found problems in the Excel (.xlsx) file. Use Excel's built in “character map” and choose Unicode as the encoding.
+
==Repairing Diacritics==
  
[http://intranet.lib.ua.edu/wiki/digcoll/images/e/e6/French_encoding_map.xlsx This Excel file] may be useful in getting one started in searching/replacing diacritics for French language metadata.
+
# Open the Excel spreadsheet for the collection (or collection batch)
 +
# Use Excel's built in character map to replace all found problems in the Excel file
 +
## To access the character map, follow this path: Insert tab - Symbols group - Symbol
 +
## Select the character you need to replace and make sure "Unicode Hex" is selected
 +
## Insert the character
 +
# Close the text export and delete it (DO NOT CLOSE/DELETE THE SPREADSHEET)
 +
# Export the metadata as a new Unicode text file
 +
# Continue on (or go back) to [[Making MODS]]
  
  
4. From Excel, export the metadata as a Unicode tab-delimited file.
+
[http://intranet.lib.ua.edu/wiki/digcoll/images/e/e6/French_encoding_map.xlsx This Excel file] may be useful in getting one started in searching/replacing diacritics for French language metadata.
 
+
 
+
5. Open the Unicode export in Notepad ++ and choose "Convert to ANSI".
+
 
+
 
+
6. Load the ANSI text file into Encodinator and make sure all is well. If there are still problems, go back to step 3.
+
 
+
 
+
7. If all is well, open the ANSI text file in NotePad ++ and choose "Convert to UTF-8".
+
  This UTF-8 file is now the file from which to create MODS with [http://lb-416-003.lib.ua-net.ua.edu/notes/?f=MARC Archivist Utility]. This is also the text version of the metadata that will go into long term Storage.
+

Latest revision as of 13:03, 19 February 2013

As metadata spreadsheets exchange hands and are often even created from diverse sources, issues arise regarding diacritics. These characters often do no translate from encoding to encoding, creating poor results in the resultant MODS metadata files.

Encoding problems will appear as black rectangular three letter "blocks" in Notepad++ or diamond shaped question marks in Archivist Utility.

Based on our experience with French language diacritics, the following process allowed MODS to be created without the numerous encoding problems that we initially encountered with the French Revolutionary Pamphlets (u0002_0000006).

  To do these steps, you will need a Windows computer with: Microsoft Excel and NotePad ++.
  You will also need the TextFX plusing for NotePad ++.
  To get plugins for NotePad++ refer to the application documentation with regard to the Plugin Manager.

[edit] Finding Diacritics

This method uses a tab delimited text export of the metadata spreadsheet. Note: If the Excel file is in the legacy .xls format, first convert it to .xlsx prior to export.

  1. Open the file in NotePad ++ and choose "Encode in UTF-8".
  2. Use the TextFX plugin and choose: TextFX Characters > zap all non-printable characters to #
  3. Do a search for all instances of "#"
  4. If this character shows up, it represents a diacritic -- keep this text export open as a guide for the next step, Repairing Diacritics
  5. If this character does not show up, continue on to Making MODS


[edit] Repairing Diacritics

  1. Open the Excel spreadsheet for the collection (or collection batch)
  2. Use Excel's built in character map to replace all found problems in the Excel file
    1. To access the character map, follow this path: Insert tab - Symbols group - Symbol
    2. Select the character you need to replace and make sure "Unicode Hex" is selected
    3. Insert the character
  3. Close the text export and delete it (DO NOT CLOSE/DELETE THE SPREADSHEET)
  4. Export the metadata as a new Unicode text file
  5. Continue on (or go back) to Making MODS


This Excel file may be useful in getting one started in searching/replacing diacritics for French language metadata.

Personal tools